Question 1

What does image-as-query actually mean in a RAG pipeline?

Accepted Answer

Traditional RAG takes a text question, embeds it, retrieves text chunks, and generates a text answer. Image-as-query flips the input: the user uploads a photo instead of typing. The pipeline converts the image into something the retriever understands — in this lab, a natural-language description produced by a VLM — and the rest of retrieval looks identical. The VLM is an image-to-text translator; the retriever is still text-over-text. This is the simplest and most portable multimodal RAG pattern.

Question 2

Why use a VLM to describe the image instead of an image embedding model directly?

Accepted Answer

Image embedders like CLIP produce a vector that lives in a joint image-text space, which works if your corpus is also embedded in that space. This lab's corpus is text embedded with `llama-3.2-nv-embedqa-1b-v2` (a 2048-dim text-only model), so a CLIP vector wouldn't match. Using a VLM to generate a text description and then embedding that description lets you reuse your existing text retriever unchanged — same vector DB, same index, same similarity metric — and usually produces better recall on descriptive queries than a raw image vector would.

Question 3

Why does input_type matter so much for NeMo Retriever embeddings?

Accepted Answer

`llama-3.2-nv-embedqa-1b-v2` is asymmetric — it prepends different instruction prefixes at encoding time for passages versus queries, and it's trained so that a `passage` vector and a `query` vector of semantically related text are close in cosine space. Pass `input_type="query"` when indexing and you get back vectors that don't align with the passage distribution; recall drops ~10–20% on retrieval benchmarks. The lab enforces the distinction by having you implement separate `embed_corpus` and `embed_query` calls.

Question 4

Do I need a vector database for this lab?

Accepted Answer

No. The corpus is small — a handful of product descriptions — so cosine-over-a-numpy-matrix is fast enough and keeps the lab focused on the retrieval *pattern* rather than database operations. Because the embeddings come back L2-normalized from NeMo Retriever, cosine similarity collapses to a single dot product (`CORPUS_VECTORS @ query_vec`), which is one numpy call. The `rag-pipeline-nim` lab covers the Milvus side for when your corpus outgrows memory.

Question 5

How does Step 4's end-to-end answer differ from just returning the top retrieval?

Accepted Answer

The top retrieval gives you the single best-matching product description. Step 4's `answer_with_rag` goes further: it builds a chat completion with the original image in the `content` parts **and** the top-k retrieved passages as context, then asks the VLM a grounded question like "which of these products is the closest match, and why?" The VLM now answers with both pixels and text in view — it can reason about color, shape, or framing from the image and reference specific product attributes from the retrieved text. This is the full multimodal RAG loop, not just retrieval.

Question 6

What's the difference between this lab and rag-pipeline-nim?

Accepted Answer

`rag-pipeline-nim` is text-in/text-out RAG built on Milvus Lite and an agentic ReAct loop — the user asks a typed question, the agent decides whether to retrieve, and the answer is grounded in text chunks. This lab keeps the same retrieval concept but makes the query image-based and uses NeMo Retriever's asymmetric text embeddings plus a VLM for image-to-query translation. They're meant to be taken in order: the text pipeline teaches retrieval fundamentals, this one layers multimodality on top.

Multimodal RAG with NeMo Retriever

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this multimodal RAG lab

Frequently asked questions