Question 1

What is the projector actually doing between the vision encoder and the LM?

Accepted Answer

Dimension matching and token selection. The vision encoder outputs patch features in its own hidden size (say 1280); the LM expects tokens in a different size (say 1536 for Qwen2-VL-2B). The projector (a small MLP or, in Qwen2-VL's case, a `merger` that also spatially downsamples) converts between the two spaces and controls how many visual tokens reach the LM. Fewer tokens = faster inference but less spatial detail; more tokens = better grounding but more KV-cache per image.

Question 2

What's the Q-Former in BLIP-2 vs the MLP projector in LLaVA vs Qwen2-VL's merger?

Accepted Answer

Three different bets on the same problem. BLIP-2's Q-Former is a small learned transformer with a fixed number of query tokens — it compresses a variable image representation into a constant-length set, at the cost of an extra trainable module. LLaVA uses a plain MLP with no compression: every encoder patch becomes a token the LM sees, which is simpler to train and shows stronger grounding but scales poorly to high-res. Qwen2-VL's merger spatially downsamples adjacent patches before projection, trading some resolution for many fewer tokens.

Question 3

Why does the VLM sometimes confidently hallucinate image details?

Accepted Answer

Two common causes. First, the vision encoder runs at a fixed input resolution with a fixed patch size — if the detail (a small logo, a digit, a face expression) is smaller than the patch can resolve, the encoder never saw it and no amount of better prompting recovers it. Second, the LM has strong priors: when the visual signal is ambiguous, it leans on what 'usually' happens and produces fluent, wrong answers. The reflection step asks you to pick one example from your `qa_pairs` and attribute the failure to the right layer.

Question 4

Will Qwen2-VL-2B really fit on the GPU you provision?

Accepted Answer

Easily. Loaded in `float16` it's about 4 GB of weights plus activations and the KV cache for a few hundred vision tokens — comfortably inside an 8 GB budget, let alone the 24+ GB cards the labs typically run on. The lab uses the 2B variant specifically so you can iterate quickly; the 7B and 72B variants of Qwen2-VL have the same architecture and API and scale up when you want more reasoning quality, at proportional inference cost.

Question 5

Do I need to write a custom chat template?

Accepted Answer

No — use the processor's built-in `apply_chat_template`. The Qwen2-VL processor knows how to serialize `[{'role': 'user', 'content': [{'type': 'image'}, {'type': 'text', 'text': ...}]}]` into the exact token sequence the model was instruction-tuned on, including the special image placeholder tokens the visual merger writes into. Rolling your own template is a common source of degraded quality — the model is sensitive to whitespace and token order, and the chat template was part of training.

Question 6

How do I know my architecture bucketing in Step 4 is correct?

Accepted Answer

Inspect the module names before summing. For Qwen2-VL, everything under the `visual.` submodule is the vision tower and the `visual.merger` is the projector; everything under `model.` (embedding, layers, norm, lm_head) is the language side. Print `[(name, p.numel()) for name, p in vlm.named_parameters()][:30]` to see the prefixes, then bucket. The grader checks that all three buckets are non-zero and that they sum to more than 500M for a 2B VLM — if your buckets miss the bulk of the parameters, you forgot a prefix.

Vision-Language Models: Captioning and Visual QA

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this vision-language model lab

Frequently asked questions