Multimodal RAG with NeMo Retriever
Hosted
Beta

Multimodal RAG with NeMo Retriever

Build an image-query RAG system: embed a catalog with NeMo Retriever, translate an uploaded image into a retrieval query via a VLM, and ground the VLM's final answer in the retrieved passages.

35 min·4 steps·3 domains·Intermediate·ncp-aainca-genm

What you'll learn

  1. 1
    Embed a product corpus
    NVIDIA's llama-3.2-nv-embedqa-1b-v2 is a 2048-dim text embedding model optimized for retrieval. The NIM API follows the OpenAI /v1/embeddings spec, but adds an input_type parameter you should always set:
  2. 2
    Cosine-similarity retriever
    For a catalog of <1M rows, cosine similarity over an in-memory matrix is fast enough and lets you stay focused on the retrieval *pattern* rather than a vector DB.
  3. 3
    Describe the image, then retrieve
    The VLM acts as an image-to-text translator: take a user-uploaded image, describe it in words, feed those words into the retriever.
  4. 4
    End-to-end grounded answer
    Now close the loop. Instead of just retrieving, feed the top-k passages back into the VLM as context and ask it a concrete grounded question: *"Given this image, which of these products is the closest match, and why?"*

Prerequisites

  • Completed `rag-pipeline-nim` or comparable RAG exposure
  • Completed `vlm-visual-qa` (for VLM basics)
  • Comfortable with numpy / cosine similarity

Exam domains covered

Retrieval-Augmented GenerationMultimodal AINVIDIA Platform Implementation

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

RAGNeMo RetrieverEmbeddingsMultimodalVLM

What you'll build in this multimodal RAG lab

Multimodal RAG — where the user's query is an image rather than text — is what makes e-commerce visual search, ticket triage from screenshots, and document Q&A from photographed forms actually work. This lab builds an image-as-query retrieval pipeline against NeMo Retriever and a Nemotron VLM, both served via NVIDIA NIM endpoints we provision. You walk away with a working answer_with_rag(image, question) function, the mental model for when to use a VLM as an image-to-text translator versus a joint image-text embedding model, and a pattern that ports directly onto production multimodal search.

The technical substance is the three-layer pipeline that makes this reliable. Retrieval uses llama-3.2-nv-embedqa-1b-v2 with the asymmetric input_type flag — "passage" at indexing, "query" at lookup — to produce 2048-dim vectors, and you do cosine similarity as a single dot product over the L2-normalised corpus matrix. The VLM — nvidia/nemotron-nano-12b-v2-vl — acts as an image-to-text translator: image_to_query(data_url) turns a photo of a dark-brown wooden chair into a natural-language description that matches a corpus entry about the MidnightOak chair. The final step composes both — a chat completion with the original image part and the top-k retrieved passages both in scope — so the VLM grounds its answer in pixels and text simultaneously, dodging the classic RAG hallucination mode.

Prerequisites: the rag-pipeline-nim lab (or equivalent text RAG exposure), the vlm-visual-qa lab for VLM basics, and numpy comfort with dot products and L2 normalisation. The hosted environment ships with the OpenAI Python SDK and numpy preinstalled, pointing at our managed NIM proxy where the embedder, VLM, and generator all share the same endpoint — no keys, no GPU pod. About 35 minutes of focused work. You leave with a 2048-dim corpus matrix, an asymmetric-aware retriever, a VLM-powered image-to-query translator, and an end-to-end grounded answer that references the retrieved product by name — the exact shape production visual search shops in.

Frequently asked questions

What does image-as-query actually mean in a RAG pipeline?

Traditional RAG takes a text question, embeds it, retrieves text chunks, and generates a text answer. Image-as-query flips the input: the user uploads a photo instead of typing. The pipeline converts the image into something the retriever understands — in this lab, a natural-language description produced by a VLM — and the rest of retrieval looks identical. The VLM is an image-to-text translator; the retriever is still text-over-text. This is the simplest and most portable multimodal RAG pattern.

Why use a VLM to describe the image instead of an image embedding model directly?

Image embedders like CLIP produce a vector that lives in a joint image-text space, which works if your corpus is also embedded in that space. This lab's corpus is text embedded with llama-3.2-nv-embedqa-1b-v2 (a 2048-dim text-only model), so a CLIP vector wouldn't match. Using a VLM to generate a text description and then embedding that description lets you reuse your existing text retriever unchanged — same vector DB, same index, same similarity metric — and usually produces better recall on descriptive queries than a raw image vector would.

Why does input_type matter so much for NeMo Retriever embeddings?

llama-3.2-nv-embedqa-1b-v2 is asymmetric — it prepends different instruction prefixes at encoding time for passages versus queries, and it's trained so that a passage vector and a query vector of semantically related text are close in cosine space. Pass input_type="query" when indexing and you get back vectors that don't align with the passage distribution; recall drops ~10–20% on retrieval benchmarks. The lab enforces the distinction by having you implement separate embed_corpus and embed_query calls.

Do I need a vector database for this lab?

No. The corpus is small — a handful of product descriptions — so cosine-over-a-numpy-matrix is fast enough and keeps the lab focused on the retrieval pattern rather than database operations. Because the embeddings come back L2-normalized from NeMo Retriever, cosine similarity collapses to a single dot product (CORPUS_VECTORS @ query_vec), which is one numpy call. The rag-pipeline-nim lab covers the Milvus side for when your corpus outgrows memory.

How does Step 4's end-to-end answer differ from just returning the top retrieval?

The top retrieval gives you the single best-matching product description. Step 4's answer_with_rag goes further: it builds a chat completion with the original image in the content parts and the top-k retrieved passages as context, then asks the VLM a grounded question like "which of these products is the closest match, and why?" The VLM now answers with both pixels and text in view — it can reason about color, shape, or framing from the image and reference specific product attributes from the retrieved text. This is the full multimodal RAG loop, not just retrieval.

What's the difference between this lab and rag-pipeline-nim?

rag-pipeline-nim is text-in/text-out RAG built on Milvus Lite and an agentic ReAct loop — the user asks a typed question, the agent decides whether to retrieve, and the answer is grounded in text chunks. This lab keeps the same retrieval concept but makes the query image-based and uses NeMo Retriever's asymmetric text embeddings plus a VLM for image-to-query translation. They're meant to be taken in order: the text pipeline teaches retrieval fundamentals, this one layers multimodality on top.