Build a RAG Pipeline with NVIDIA NIM
Hosted
Beta

Build a RAG Pipeline with NVIDIA NIM

Build a complete Retrieval Augmented Generation pipeline — from document chunking to vector search to an agent that answers questions from your knowledge base.

35 min·7 steps·3 domains·Intermediate·ncp-aainca-genl

What you'll learn

  1. 1
    Create a Knowledge Base
    Retrieval Augmented Generation (RAG) solves a fundamental LLM limitation: models only know what they were trained on. If you ask about your company's internal docs, yesterday's news, or domain-specific data — the LLM will either hallucinate or say "I don't know."
  2. 2
    Chunk Documents
    Embedding models have a context window — typically 512 tokens. A full document might be thousands of tokens. If you embed the entire document as one vector, you lose granularity: the vector represents the "average meaning" of everything, not specific concepts.
  3. 3
    Generate Embeddings with NIM
    An embedding is a list of numbers (a vector) that represents the *meaning* of a piece of text. Texts with similar meanings have vectors that are close together in the vector space.
  4. 4
    Store Vectors in Milvus
    You have vectors — now you need somewhere to store them and search by similarity. That's what a vector database does.
  5. 5
    Semantic Search
    With your vectors stored, you can now search by meaning — not keywords. This is the retrieval step that makes RAG work.
  6. 6
    Build a RAG Agent
    Now we connect everything. Instead of manually searching and stuffing context into a prompt, we give the agent a retriever tool — it decides when and what to search.
  7. 7
    Test Your RAG Pipeline
    A RAG pipeline can fail at multiple points: bad chunking, poor embeddings, irrelevant retrieval, or incorrect answer generation. Testing the final answer alone doesn't tell you where it failed.

Prerequisites

  • Basic Python (lists, dicts, functions)
  • Completed Lab 1 (ReAct Agent) or equivalent understanding
  • Understanding of what embeddings are (vectors representing text)

Exam domains covered

Knowledge Integration and Data HandlingAgent DevelopmentNVIDIA Platform Implementation

What you'll build in this RAG-with-NIM lab

RAG is the default architecture for every serious LLM application shipping in 2026 — internal knowledge bots, documentation assistants, agentic search, customer support — because it's the only reliable way to ground a model in data it wasn't trained on. This lab takes you from raw documents to a working RAG pipeline plus a ReAct agent that decides when to retrieve, running on NVIDIA NIM endpoints we provision. You finish with a two-phase architecture in your head — a one-time indexing pipeline (load, chunk, embed, store) and a per-query retrieval pipeline (embed, search, augment, generate) — plus a template that maps 1:1 onto production NeMo Agent Toolkit retrievers and RAGAS evaluation.

The technical substance is where most RAG implementations quietly break. You work through asymmetric embeddings with nvidia/nv-embedqa-e5-v5 and learn why embed_query versus embed_documents is worth a 10–20% recall swing on the same corpus. You stand up Milvus Lite as a file-backed vector store and run cosine-similarity search over 1024-dim vectors, then graduate to an agent-based RAG pattern — a @tool-decorated search_knowledge_base handed to create_agent, where a Nemotron-powered ReAct loop decides when to retrieve instead of always stuffing context into the prompt. You see why chunk overlap preserves sentence boundaries, why agent-based RAG beats always-retrieve on questions that don't need the knowledge base, and why the RAGAS faithfulness metric is the right signal for production drift.

Prerequisites are basic Python and a working mental model of what embeddings are; prior Milvus or LangChain experience is not assumed. The hosted environment ships with langchain-nvidia-ai-endpoints, langchain-text-splitters, pymilvus, and LangGraph preinstalled; every embedding and LLM call routes through our managed NIM proxy serving nvidia/nv-embedqa-e5-v5 and nvidia/nemotron-3-super-120b-a12b — same OpenAI-compatible surface, no keys to manage, no GPU pod to provision. About 35 minutes of focused work ending with a keyword-based eval harness that runs the agent on test queries and reports pass/fail accuracy, the same shape that slots into RAGAS metrics (AnswerRelevance, ContextPrecision, Faithfulness) via nat eval for real workloads.

Frequently asked questions

Why does this lab use Milvus Lite instead of a hosted vector database?

Milvus Lite is a single-file embedded build of Milvus — no server, no Docker, no network setup. It exposes the same MilvusClient API as the distributed version, so the code you write here (create_collection, insert, search with output_fields) works unchanged against production Milvus. The point of the lab is the retrieval pattern, not the deployment topology; graduating to a hosted cluster is a config swap once the pipeline logic is solid.

What's the difference between embed_documents and embed_query on an asymmetric model?

nvidia/nv-embedqa-e5-v5 is trained with separate prefixes for passages and queries, so the two methods produce different vectors for the same text. embed_documents is what you call during indexing — it treats the input as a passage to be retrieved. embed_query is what you call at search time — it treats the input as a short question. Using the wrong one at search time typically costs 10–20% recall on retrieval-quality benchmarks, which is why the lab enforces the distinction.

Why give the agent a retrieval tool instead of always stuffing retrieved chunks into the prompt?

Always-retrieve RAG runs an embedding call and a vector search even on questions like "what's 2+2" where the knowledge base is irrelevant, and it dilutes the prompt with chunks that don't help. Giving the ReAct agent a search_knowledge_base tool lets the LLM decide — it calls the tool when the question needs domain knowledge and skips it when it doesn't. This is the pattern the NeMo Agent Toolkit encourages via function-group retrievers and the same shape a production agent ships with.

How is chunk_size=150 chosen, and would I use that in production?

For this lab, 150 characters is deliberately small so the splitter produces multiple chunks per document and you can see the effect of overlap on a readable scale. In production you'd usually sit in the 500–1000 character range for general text, use token-based splitters for code or tables, and consider semantic chunkers for long-form content. The tradeoff is always: smaller chunks give more precise retrieval, larger chunks preserve more context per hit.

What models does the NIM proxy expose in this lab?

Two: nvidia/nv-embedqa-e5-v5 for the 1024-dim asymmetric embeddings, and nvidia/nemotron-3-super-120b-a12b as the reasoning LLM driving the ReAct agent. Both are reached through the same OpenAI-compatible endpoint at http://nim-proxy.labs.svc:8080/v1, which is why NVIDIAEmbeddings and ChatNVIDIA point at identical base_url strings. You don't manage keys — the proxy injects credentials on your behalf.

How is the final RAG quality actually evaluated?

Step 7 uses keyword-based correctness — for each test case, the agent runs, and the final answer is checked for an expected substring (e.g., the answer to "What is Milvus?" should contain "vector"). It's a deliberately simple harness to illustrate the pass/fail accuracy loop. In production, the same shape plugs into RAGAS metrics (AnswerRelevance, ContextPrecision, Faithfulness) via NeMo Agent Toolkit's nat eval, which the lab points you to at the end.