Question 1

How do I pass an image to a VLM through the chat completions API?

Accepted Answer

Swap the `content` string on the user message for a list of content parts. Each part has a `type` field: `type: "text"` for the prompt, `type: "image_url"` for an image. The `image_url` can be an `https://` URL the model can fetch, or a data URL like `data:image/jpeg;base64,<b64>` — the base64-inlined form is the reliable choice inside a lab because there's no network fetch to depend on. Nothing else about the request changes — `model`, `messages`, `temperature`, and the response format are the same chat completion you already know.

Question 2

Can a VLM reason about multiple images in a single turn?

Accepted Answer

Yes. Put more than one `image_url` part in the `content` list of a single user message and the VLM attends to all of them inside one context. Step 2 of this lab puts a 3-circle image next to a 5-circle image and asks which has more; the model sees both and answers from the joint context. This is the primitive behind comparison tasks — A/B screenshots, receipt vs. ledger entry, before/after photos — and it scales up to any number of images the model's context window can hold.

Question 3

Why does function calling work with Nemotron VL but not Llama 3.2 Vision?

Accepted Answer

Function calling is a per-model capability, not a blanket feature of multimodal endpoints. `nvidia/nemotron-nano-12b-v2-vl` ships with `tools` support, so Step 3 gets back a clean `tool_calls` with validated `arguments`. `meta-llama/llama-3.2-11b-vision-instruct` does not — the NIM router returns `404 No endpoints found that support tool use` if you try. Step 4 uses this split deliberately: the agent wrapper you build has to detect which mode is available and fall back to prompt-only JSON when tools aren't supported.

Question 4

Is prompt-only JSON extraction reliable enough on a VLM?

Accepted Answer

Less reliable than function calling, but workable for a single well-scoped schema. On the fake receipt in Step 4, `meta-llama/llama-3.2-11b-vision-instruct` asked for JSON in prose produces mostly-correct output most of the time, but you need defensive parsing (strip markdown fences, tolerate trailing commas, null-check every field) and you should treat extraction failures as expected. For complex schemas or production-scale extraction you'd want a VLM that supports tools — or you'd use a VLM to describe the image and a tool-capable text model to do the actual structured extraction.

Question 5

How is the Step 3 structured extraction different from a text-only tool call?

Accepted Answer

Mechanically it's identical — same `tools` parameter, same `tool_choice`, same `function.arguments` string on the return — but the model is looking at pixels to fill the schema instead of reading text. That means extraction accuracy depends on image quality and rendering style, not just language. The lab deliberately draws a clean, high-contrast receipt so the vision stage isn't the bottleneck and you can focus on the schema-enforcement pattern. Noisy receipts would add an OCR-style error mode on top.

Question 6

What's the difference between this VLM lab and the multimodal RAG one?

Accepted Answer

This lab focuses on **VLM core capabilities**: captioning, multi-image comparison, and schema-enforced extraction from a single image. The `multimodal-rag` lab layers a VLM into a retrieval pipeline — the image becomes a *query* that gets translated into text, retrieved against a product corpus, and grounded in the top-k passages. Here the VLM answers from pixels alone; there the VLM is one component in a larger retrieval-augmented system. The lab order matters: you need the primitives from this one to understand how they plug into the RAG one.

Visual Q&A with NVIDIA VLMs

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this VLM visual Q&A lab

Frequently asked questions