Build a RAG Pipeline with NVIDIA NIM
Build a complete Retrieval Augmented Generation pipeline — from document chunking to vector search to an agent that answers questions from your knowledge base.
What you'll learn
- 1Create a Knowledge BaseRetrieval Augmented Generation (RAG) solves a fundamental LLM limitation: models only know what they were trained on. If you ask about your company's internal docs, yesterday's news, or domain-specific data — the LLM will either hallucinate or say "I don't know."
- 2Chunk DocumentsEmbedding models have a context window — typically 512 tokens. A full document might be thousands of tokens. If you embed the entire document as one vector, you lose granularity: the vector represents the "average meaning" of everything, not specific concepts.
- 3Generate Embeddings with NIMAn embedding is a list of numbers (a vector) that represents the *meaning* of a piece of text. Texts with similar meanings have vectors that are close together in the vector space.
- 4Store Vectors in MilvusYou have vectors — now you need somewhere to store them and search by similarity. That's what a vector database does.
- 5Semantic SearchWith your vectors stored, you can now search by meaning — not keywords. This is the retrieval step that makes RAG work.
- 6Build a RAG AgentNow we connect everything. Instead of manually searching and stuffing context into a prompt, we give the agent a retriever tool — it decides when and what to search.
- 7Test Your RAG PipelineA RAG pipeline can fail at multiple points: bad chunking, poor embeddings, irrelevant retrieval, or incorrect answer generation. Testing the final answer alone doesn't tell you where it failed.
Prerequisites
- Basic Python (lists, dicts, functions)
- Completed Lab 1 (ReAct Agent) or equivalent understanding
- Understanding of what embeddings are (vectors representing text)
Exam domains covered
What you'll build in this RAG-with-NIM lab
RAG is the default architecture for every serious LLM application shipping in 2026 — internal knowledge bots, documentation assistants, agentic search, customer support — because it's the only reliable way to ground a model in data it wasn't trained on. This lab takes you from raw documents to a working RAG pipeline plus a ReAct agent that decides when to retrieve, running on NVIDIA NIM endpoints we provision. You finish with a two-phase architecture in your head — a one-time indexing pipeline (load, chunk, embed, store) and a per-query retrieval pipeline (embed, search, augment, generate) — plus a template that maps 1:1 onto production NeMo Agent Toolkit retrievers and RAGAS evaluation.
The technical substance is where most RAG implementations quietly break. You work through asymmetric embeddings with nvidia/nv-embedqa-e5-v5 and learn why embed_query versus embed_documents is worth a 10–20% recall swing on the same corpus. You stand up Milvus Lite as a file-backed vector store and run cosine-similarity search over 1024-dim vectors, then graduate to an agent-based RAG pattern — a @tool-decorated search_knowledge_base handed to create_agent, where a Nemotron-powered ReAct loop decides when to retrieve instead of always stuffing context into the prompt. You see why chunk overlap preserves sentence boundaries, why agent-based RAG beats always-retrieve on questions that don't need the knowledge base, and why the RAGAS faithfulness metric is the right signal for production drift.
Prerequisites are basic Python and a working mental model of what embeddings are; prior Milvus or LangChain experience is not assumed. The hosted environment ships with langchain-nvidia-ai-endpoints, langchain-text-splitters, pymilvus, and LangGraph preinstalled; every embedding and LLM call routes through our managed NIM proxy serving nvidia/nv-embedqa-e5-v5 and nvidia/nemotron-3-super-120b-a12b — same OpenAI-compatible surface, no keys to manage, no GPU pod to provision. About 35 minutes of focused work ending with a keyword-based eval harness that runs the agent on test queries and reports pass/fail accuracy, the same shape that slots into RAGAS metrics (AnswerRelevance, ContextPrecision, Faithfulness) via nat eval for real workloads.
Frequently asked questions
Why does this lab use Milvus Lite instead of a hosted vector database?
MilvusClient API as the distributed version, so the code you write here (create_collection, insert, search with output_fields) works unchanged against production Milvus. The point of the lab is the retrieval pattern, not the deployment topology; graduating to a hosted cluster is a config swap once the pipeline logic is solid.What's the difference between embed_documents and embed_query on an asymmetric model?
nvidia/nv-embedqa-e5-v5 is trained with separate prefixes for passages and queries, so the two methods produce different vectors for the same text. embed_documents is what you call during indexing — it treats the input as a passage to be retrieved. embed_query is what you call at search time — it treats the input as a short question. Using the wrong one at search time typically costs 10–20% recall on retrieval-quality benchmarks, which is why the lab enforces the distinction.Why give the agent a retrieval tool instead of always stuffing retrieved chunks into the prompt?
search_knowledge_base tool lets the LLM decide — it calls the tool when the question needs domain knowledge and skips it when it doesn't. This is the pattern the NeMo Agent Toolkit encourages via function-group retrievers and the same shape a production agent ships with.How is chunk_size=150 chosen, and would I use that in production?
What models does the NIM proxy expose in this lab?
nvidia/nv-embedqa-e5-v5 for the 1024-dim asymmetric embeddings, and nvidia/nemotron-3-super-120b-a12b as the reasoning LLM driving the ReAct agent. Both are reached through the same OpenAI-compatible endpoint at http://nim-proxy.labs.svc:8080/v1, which is why NVIDIAEmbeddings and ChatNVIDIA point at identical base_url strings. You don't manage keys — the proxy injects credentials on your behalf.How is the final RAG quality actually evaluated?
nat eval, which the lab points you to at the end.