Question 1

Why does this lab use Milvus Lite instead of a hosted vector database?

Accepted Answer

Milvus Lite is a single-file embedded build of Milvus — no server, no Docker, no network setup. It exposes the same `MilvusClient` API as the distributed version, so the code you write here (`create_collection`, `insert`, `search` with `output_fields`) works unchanged against production Milvus. The point of the lab is the retrieval pattern, not the deployment topology; graduating to a hosted cluster is a config swap once the pipeline logic is solid.

Question 2

What's the difference between embed_documents and embed_query on an asymmetric model?

Accepted Answer

`nvidia/nv-embedqa-e5-v5` is trained with separate prefixes for passages and queries, so the two methods produce different vectors for the same text. `embed_documents` is what you call during indexing — it treats the input as a passage to be retrieved. `embed_query` is what you call at search time — it treats the input as a short question. Using the wrong one at search time typically costs 10–20% recall on retrieval-quality benchmarks, which is why the lab enforces the distinction.

Question 3

Why give the agent a retrieval tool instead of always stuffing retrieved chunks into the prompt?

Accepted Answer

Always-retrieve RAG runs an embedding call and a vector search even on questions like "what's 2+2" where the knowledge base is irrelevant, and it dilutes the prompt with chunks that don't help. Giving the ReAct agent a `search_knowledge_base` tool lets the LLM decide — it calls the tool when the question needs domain knowledge and skips it when it doesn't. This is the pattern the NeMo Agent Toolkit encourages via function-group retrievers and the same shape a production agent ships with.

Question 4

How is chunk_size=150 chosen, and would I use that in production?

Accepted Answer

For this lab, 150 characters is deliberately small so the splitter produces multiple chunks per document and you can see the effect of overlap on a readable scale. In production you'd usually sit in the 500–1000 character range for general text, use token-based splitters for code or tables, and consider semantic chunkers for long-form content. The tradeoff is always: smaller chunks give more precise retrieval, larger chunks preserve more context per hit.

Question 5

What models does the NIM proxy expose in this lab?

Accepted Answer

Two: `nvidia/nv-embedqa-e5-v5` for the 1024-dim asymmetric embeddings, and `nvidia/nemotron-3-super-120b-a12b` as the reasoning LLM driving the ReAct agent. Both are reached through the same OpenAI-compatible endpoint at `http://nim-proxy.labs.svc:8080/v1`, which is why `NVIDIAEmbeddings` and `ChatNVIDIA` point at identical `base_url` strings. You don't manage keys — the proxy injects credentials on your behalf.

Question 6

How is the final RAG quality actually evaluated?

Accepted Answer

Step 7 uses keyword-based correctness — for each test case, the agent runs, and the final answer is checked for an expected substring (e.g., the answer to "What is Milvus?" should contain "vector"). It's a deliberately simple harness to illustrate the pass/fail accuracy loop. In production, the same shape plugs into RAGAS metrics (AnswerRelevance, ContextPrecision, Faithfulness) via NeMo Agent Toolkit's `nat eval`, which the lab points you to at the end.

Build a RAG Pipeline with NVIDIA NIM

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this RAG-with-NIM lab

Frequently asked questions