Vision-Language Models: Captioning and Visual QA
GPU sandbox · jupyter
Beta

Vision-Language Models: Captioning and Visual QA

Load Qwen2-VL, caption a real image, run a battery of visual question-answering prompts, and dissect the architecture — vision encoder, projector, language model — to see exactly how pixels become tokens the LLM can reason over.

35 min·4 steps·3 domains·Intermediate·nca-genm

What you'll learn

  1. 1
    Load Qwen2-VL and a test image
  2. 2
    Caption the image
  3. 3
    Visual Question Answering
  4. 4
    Inspect the architecture

Prerequisites

  • Comfortable with Hugging Face transformers and processors
  • Basic PIL / image-handling in Python
  • Familiarity with tokenizers and generation APIs

Exam domains covered

Multimodal and Computer VisionLLM Integration and DevelopmentModel Deployment & Inference Optimization

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

VLMQwen2-VLImage CaptioningVisual QAMultimodalVision EncoderProjectorHugging Face

What you'll build in this vision-language model lab

Vision-language models are what's behind every "upload a photo, get an answer" product shipping in 2026 — retail visual search, document understanding, medical triage, autonomous UI agents — and the architecture is now stable enough that you can learn it in one sitting. In about 35 minutes on a real NVIDIA GPU we provision, you'll drive Qwen2-VL-2B through real captioning and visual-QA work, dissect its three-part anatomy (vision encoder, projector, language model) by walking named_parameters(), and come out with a concrete diagnostic framework for when a VLM hallucinates: is it the encoder missing detail at its patch resolution, the projector bottlenecking visual tokens, or the LM overriding weak visual signal with priors? That framework is the difference between prompt-engineering your way around a failure and actually fixing it.

The substance is how modern VLMs wire pixels into an LLM's embedding space. You'll use AutoModelForVision2Seq with the matching AutoProcessor and its apply_chat_template (critical — Qwen2-VL is sensitive to token order and image placeholder positioning, and rolling your own template is a common quality hit), call model.generate(**inputs, max_new_tokens=100) on a real image, and run a battery of counting / attribute / spatial / inference questions to surface where grounding breaks. The architecture inspection step is the payoff: you'll see that in Qwen2-VL-2B everything under visual. is the vision tower, visual.merger is the projector that spatially downsamples patches into fewer LM tokens (versus BLIP-2's Q-Former or LLaVA's plain MLP — three different bets on the same problem), and model.* is the LM itself. You'll know exactly how many parameters sit in each bucket and why Qwen's merger trades resolution for fewer KV-cache tokens per image.

Prerequisites: comfort with Hugging Face transformers processors, basic PIL handling, and generation APIs. Qwen2-VL-2B weights (~4 GB in float16) and the processor ship preinstalled in the sandbox — you only need a browser. Same architectural pattern scales to Qwen2-VL-7B and 72B with identical API, so the code you write here ports directly when you need more reasoning quality. If you're Googling "Qwen2-VL tutorial", "how do VLMs work under the hood", "why does my VLM hallucinate image details", or "vision encoder vs projector vs language model", this is the hands-on answer.

Frequently asked questions

What is the projector actually doing between the vision encoder and the LM?

Dimension matching and token selection. The vision encoder outputs patch features in its own hidden size (say 1280); the LM expects tokens in a different size (say 1536 for Qwen2-VL-2B). The projector (a small MLP or, in Qwen2-VL's case, a merger that also spatially downsamples) converts between the two spaces and controls how many visual tokens reach the LM. Fewer tokens = faster inference but less spatial detail; more tokens = better grounding but more KV-cache per image.

What's the Q-Former in BLIP-2 vs the MLP projector in LLaVA vs Qwen2-VL's merger?

Three different bets on the same problem. BLIP-2's Q-Former is a small learned transformer with a fixed number of query tokens — it compresses a variable image representation into a constant-length set, at the cost of an extra trainable module. LLaVA uses a plain MLP with no compression: every encoder patch becomes a token the LM sees, which is simpler to train and shows stronger grounding but scales poorly to high-res. Qwen2-VL's merger spatially downsamples adjacent patches before projection, trading some resolution for many fewer tokens.

Why does the VLM sometimes confidently hallucinate image details?

Two common causes. First, the vision encoder runs at a fixed input resolution with a fixed patch size — if the detail (a small logo, a digit, a face expression) is smaller than the patch can resolve, the encoder never saw it and no amount of better prompting recovers it. Second, the LM has strong priors: when the visual signal is ambiguous, it leans on what 'usually' happens and produces fluent, wrong answers. The reflection step asks you to pick one example from your qa_pairs and attribute the failure to the right layer.

Will Qwen2-VL-2B really fit on the GPU you provision?

Easily. Loaded in float16 it's about 4 GB of weights plus activations and the KV cache for a few hundred vision tokens — comfortably inside an 8 GB budget, let alone the 24+ GB cards the labs typically run on. The lab uses the 2B variant specifically so you can iterate quickly; the 7B and 72B variants of Qwen2-VL have the same architecture and API and scale up when you want more reasoning quality, at proportional inference cost.

Do I need to write a custom chat template?

No — use the processor's built-in apply_chat_template. The Qwen2-VL processor knows how to serialize [{'role': 'user', 'content': [{'type': 'image'}, {'type': 'text', 'text': ...}]}] into the exact token sequence the model was instruction-tuned on, including the special image placeholder tokens the visual merger writes into. Rolling your own template is a common source of degraded quality — the model is sensitive to whitespace and token order, and the chat template was part of training.

How do I know my architecture bucketing in Step 4 is correct?

Inspect the module names before summing. For Qwen2-VL, everything under the visual. submodule is the vision tower and the visual.merger is the projector; everything under model. (embedding, layers, norm, lm_head) is the language side. Print [(name, p.numel()) for name, p in vlm.named_parameters()][:30] to see the prefixes, then bucket. The grader checks that all three buckets are non-zero and that they sum to more than 500M for a 2B VLM — if your buckets miss the bulk of the parameters, you forgot a prefix.