Multimodal RAG with NeMo Retriever
Build an image-query RAG system: embed a catalog with NeMo Retriever, translate an uploaded image into a retrieval query via a VLM, and ground the VLM's final answer in the retrieved passages.
What you'll learn
- 1Embed a product corpusNVIDIA's llama-3.2-nv-embedqa-1b-v2 is a 2048-dim text embedding model optimized for retrieval. The NIM API follows the OpenAI /v1/embeddings spec, but adds an input_type parameter you should always set:
- 2Cosine-similarity retrieverFor a catalog of <1M rows, cosine similarity over an in-memory matrix is fast enough and lets you stay focused on the retrieval *pattern* rather than a vector DB.
- 3Describe the image, then retrieveThe VLM acts as an image-to-text translator: take a user-uploaded image, describe it in words, feed those words into the retriever.
- 4End-to-end grounded answerNow close the loop. Instead of just retrieving, feed the top-k passages back into the VLM as context and ask it a concrete grounded question: *"Given this image, which of these products is the closest match, and why?"*
Prerequisites
- Completed `rag-pipeline-nim` or comparable RAG exposure
- Completed `vlm-visual-qa` (for VLM basics)
- Comfortable with numpy / cosine similarity
Exam domains covered
Skills & technologies you'll practice
This intermediate-level ai/ml lab gives you real-world reps across:
What you'll build in this multimodal RAG lab
Multimodal RAG — where the user's query is an image rather than text — is what makes e-commerce visual search, ticket triage from screenshots, and document Q&A from photographed forms actually work. This lab builds an image-as-query retrieval pipeline against NeMo Retriever and a Nemotron VLM, both served via NVIDIA NIM endpoints we provision. You walk away with a working answer_with_rag(image, question) function, the mental model for when to use a VLM as an image-to-text translator versus a joint image-text embedding model, and a pattern that ports directly onto production multimodal search.
The technical substance is the three-layer pipeline that makes this reliable. Retrieval uses llama-3.2-nv-embedqa-1b-v2 with the asymmetric input_type flag — "passage" at indexing, "query" at lookup — to produce 2048-dim vectors, and you do cosine similarity as a single dot product over the L2-normalised corpus matrix. The VLM — nvidia/nemotron-nano-12b-v2-vl — acts as an image-to-text translator: image_to_query(data_url) turns a photo of a dark-brown wooden chair into a natural-language description that matches a corpus entry about the MidnightOak chair. The final step composes both — a chat completion with the original image part and the top-k retrieved passages both in scope — so the VLM grounds its answer in pixels and text simultaneously, dodging the classic RAG hallucination mode.
Prerequisites: the rag-pipeline-nim lab (or equivalent text RAG exposure), the vlm-visual-qa lab for VLM basics, and numpy comfort with dot products and L2 normalisation. The hosted environment ships with the OpenAI Python SDK and numpy preinstalled, pointing at our managed NIM proxy where the embedder, VLM, and generator all share the same endpoint — no keys, no GPU pod. About 35 minutes of focused work. You leave with a 2048-dim corpus matrix, an asymmetric-aware retriever, a VLM-powered image-to-query translator, and an end-to-end grounded answer that references the retrieved product by name — the exact shape production visual search shops in.
Frequently asked questions
What does image-as-query actually mean in a RAG pipeline?
Why use a VLM to describe the image instead of an image embedding model directly?
llama-3.2-nv-embedqa-1b-v2 (a 2048-dim text-only model), so a CLIP vector wouldn't match. Using a VLM to generate a text description and then embedding that description lets you reuse your existing text retriever unchanged — same vector DB, same index, same similarity metric — and usually produces better recall on descriptive queries than a raw image vector would.Why does input_type matter so much for NeMo Retriever embeddings?
llama-3.2-nv-embedqa-1b-v2 is asymmetric — it prepends different instruction prefixes at encoding time for passages versus queries, and it's trained so that a passage vector and a query vector of semantically related text are close in cosine space. Pass input_type="query" when indexing and you get back vectors that don't align with the passage distribution; recall drops ~10–20% on retrieval benchmarks. The lab enforces the distinction by having you implement separate embed_corpus and embed_query calls.Do I need a vector database for this lab?
CORPUS_VECTORS @ query_vec), which is one numpy call. The rag-pipeline-nim lab covers the Milvus side for when your corpus outgrows memory.How does Step 4's end-to-end answer differ from just returning the top retrieval?
answer_with_rag goes further: it builds a chat completion with the original image in the content parts and the top-k retrieved passages as context, then asks the VLM a grounded question like "which of these products is the closest match, and why?" The VLM now answers with both pixels and text in view — it can reason about color, shape, or framing from the image and reference specific product attributes from the retrieved text. This is the full multimodal RAG loop, not just retrieval.What's the difference between this lab and rag-pipeline-nim?
rag-pipeline-nim is text-in/text-out RAG built on Milvus Lite and an agentic ReAct loop — the user asks a typed question, the agent decides whether to retrieve, and the answer is grounded in text chunks. This lab keeps the same retrieval concept but makes the query image-based and uses NeMo Retriever's asymmetric text embeddings plus a VLM for image-to-query translation. They're meant to be taken in order: the text pipeline teaches retrieval fundamentals, this one layers multimodality on top.