Visual Q&A with NVIDIA VLMs
Hosted
Beta

Visual Q&A with NVIDIA VLMs

Send images to a Vision-Language Model via NIM, answer questions about them, extract structured fields from a receipt-style image, and compare two VLMs on the same task — all through the OpenAI-compatible chat endpoint.

30 min·4 steps·3 domains·Intermediate·ncp-aainca-genm

What you'll learn

  1. 1
    Send one image and ask a question
    Vision-Language Models use the same OpenAI-compatible chat endpoint as text-only models, but the content field inside a user message becomes a list of content parts:
  2. 2
    Reason across two images
    A VLM can compare images in a single turn — you simply pass more than one image_url part in the content list. This is the core primitive for comparison tasks (A/B screenshots, receipt vs. ledger entry, before/after photos).
  3. 3
    Structured extraction from an image
    VLMs can produce structured output via the same tools / tool_choice primitives you used in structured-output-tools — the schema is enforced at the API boundary, so you skip the regex-and-pray parsing dance.
  4. 4
    Compare two VLMs with different extraction strategies
    Not every VLM supports the tools / tool_choice API. nvidia/nemotron-nano-12b-v2-vl does (that's what step 3 used). meta-llama/llama-3.2-11b-vision-instruct does NOT — you'll get a 404 No endpoints found that support tool use from the router. That's the split you'll see in the real world, and a good agent wrapper handles both.

Prerequisites

  • Completed `react-agent-nim` or comparable NIM exposure
  • Comfortable reading/writing base64 image payloads
  • Familiarity with JSON

Exam domains covered

Multimodal AIAgent DevelopmentNVIDIA Platform Implementation

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

VLMVisionNemotron-Nano-VLLlama 3.2 VisionMultimodal

What you'll build in this VLM visual Q&A lab

Vision-Language Models are the fastest-growing surface in production LLM apps — receipt parsing, screenshot triage, document extraction, multimodal agent input — and the tool that separates teams who can ship these features from teams still trying to bolt OCR together. This lab goes from sending a single image into a VLM all the way to a head-to-head comparison between two VLMs on a structured-extraction task, all against NVIDIA NIM endpoints we provision. You finish with working code for single-image Q&A, multi-image reasoning, schema-enforced extraction from a receipt, and a tool-tolerant helper that handles the reality that not every VLM supports function calling.

The technical core is the OpenAI-compatible multimodal content format — text and image_url parts coexist inside a single user message, the VLM fuses them into one context, and the response surface is the exact same chat completion shape text-only calls return. You'll work through base64 data URL payloads (the reliable image-payload choice when you don't want a network fetch in the loop), multi-image reasoning that's just more parts in the list rather than a separate API, and schema-enforced extraction using tools plus tool_choice for a save_receipt function with fields vendor, order_id, line items, subtotal, tax, total. The final step runs the same receipt through Nemotron VL with function calling and through meta-llama/llama-3.2-11b-vision-instruct with prompt-only JSON, because Llama 3.2 Vision returns 404 No endpoints found that support tool use when you pass tools — a real production split your code needs to handle.

Prerequisites: Python, prior NIM exposure (the react-agent-nim lab works), base64 payload comfort, and basic JSON. No VLM-specific library is assumed — everything goes through the OpenAI Python SDK pointed at our managed NIM proxy, where both nvidia/nemotron-nano-12b-v2-vl and meta-llama/llama-3.2-11b-vision-instruct are reached via the same OpenAI-compatible endpoint with no GPU provisioning. About 30 minutes of focused work. You leave with a single-image Q&A call, a multi-image comparison, schema-validated extraction via function calling, and a dual-path extractor that falls back to prompt-only JSON when tools aren't supported — the same defensive shape production VLM code needs.

Frequently asked questions

How do I pass an image to a VLM through the chat completions API?

Swap the content string on the user message for a list of content parts. Each part has a type field: type: "text" for the prompt, type: "image_url" for an image. The image_url can be an https:// URL the model can fetch, or a data URL like data:image/jpeg;base64,<b64> — the base64-inlined form is the reliable choice inside a lab because there's no network fetch to depend on. Nothing else about the request changes — model, messages, temperature, and the response format are the same chat completion you already know.

Can a VLM reason about multiple images in a single turn?

Yes. Put more than one image_url part in the content list of a single user message and the VLM attends to all of them inside one context. Step 2 of this lab puts a 3-circle image next to a 5-circle image and asks which has more; the model sees both and answers from the joint context. This is the primitive behind comparison tasks — A/B screenshots, receipt vs. ledger entry, before/after photos — and it scales up to any number of images the model's context window can hold.

Why does function calling work with Nemotron VL but not Llama 3.2 Vision?

Function calling is a per-model capability, not a blanket feature of multimodal endpoints. nvidia/nemotron-nano-12b-v2-vl ships with tools support, so Step 3 gets back a clean tool_calls with validated arguments. meta-llama/llama-3.2-11b-vision-instruct does not — the NIM router returns 404 No endpoints found that support tool use if you try. Step 4 uses this split deliberately: the agent wrapper you build has to detect which mode is available and fall back to prompt-only JSON when tools aren't supported.

Is prompt-only JSON extraction reliable enough on a VLM?

Less reliable than function calling, but workable for a single well-scoped schema. On the fake receipt in Step 4, meta-llama/llama-3.2-11b-vision-instruct asked for JSON in prose produces mostly-correct output most of the time, but you need defensive parsing (strip markdown fences, tolerate trailing commas, null-check every field) and you should treat extraction failures as expected. For complex schemas or production-scale extraction you'd want a VLM that supports tools — or you'd use a VLM to describe the image and a tool-capable text model to do the actual structured extraction.

How is the Step 3 structured extraction different from a text-only tool call?

Mechanically it's identical — same tools parameter, same tool_choice, same function.arguments string on the return — but the model is looking at pixels to fill the schema instead of reading text. That means extraction accuracy depends on image quality and rendering style, not just language. The lab deliberately draws a clean, high-contrast receipt so the vision stage isn't the bottleneck and you can focus on the schema-enforcement pattern. Noisy receipts would add an OCR-style error mode on top.

What's the difference between this VLM lab and the multimodal RAG one?

This lab focuses on VLM core capabilities: captioning, multi-image comparison, and schema-enforced extraction from a single image. The multimodal-rag lab layers a VLM into a retrieval pipeline — the image becomes a query that gets translated into text, retrieved against a product corpus, and grounded in the top-k passages. Here the VLM answers from pixels alone; there the VLM is one component in a larger retrieval-augmented system. The lab order matters: you need the primitives from this one to understand how they plug into the RAG one.