NCP-AAINVIDIAAgentic AIRAGRetrieval-Augmented GenerationVector DatabasesLangChain

RAG for AI Agents: Retrieval-Augmented Generation NCP-AAI Guide

Preporato TeamApril 1, 202635 min readNCP-AAI

Retrieval-Augmented Generation (RAG) is the most critical technology tested on the NVIDIA Certified Professional - Agentic AI (NCP-AAI) exam, accounting for an estimated 20-25% of all questions. As agentic AI systems move beyond simple chatbots to complex autonomous agents that access, reason about, and act on vast knowledge bases, mastering RAG architecture, implementation, and optimization is non-negotiable. This comprehensive guide merges pipeline fundamentals, chunking deep-dives, embedding benchmarks, vector database selection, reranking techniques, agentic RAG patterns, and NVIDIA platform integration into a single definitive resource for NCP-AAI exam success.

Start Here

New to NCP-AAI? Start with our Complete NCP-AAI Certification Guide for exam overview, domains, and study paths. Then use our NCP-AAI Cheat Sheet for quick reference and How to Pass NCP-AAI for exam strategies.

Quick Takeaways

RAG = Retrieval + Generation: Combine external knowledge retrieval with LLM generation to ground responses in verifiable sources
20-25% of NCP-AAI exam: RAG appears across multiple domains -- Knowledge Integration, Agent Design, NVIDIA Platform, and Evaluation
5-stage pipeline: Data Ingestion, Chunking, Embedding & Indexing, Retrieval, and Generation
Chunking is king: Single most important factor for RAG quality (30-40% of retrieval performance impact)
Hybrid search: Combining vector + keyword search improves accuracy by 15-25% over pure semantic search
NVIDIA stack: NeMo Retriever provides end-to-end enterprise RAG with NIM microservices for embedding, reranking, and generation
Agentic RAG: The agent decides when, what, and how much to retrieve -- the cutting edge tested on the exam

Preparing for NCP-AAI? Practice with 455+ exam questions

Try Free View Bundle - $19.99

What is RAG and Why It Matters for NCP-AAI

Core Concept

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by dynamically retrieving relevant information from external knowledge bases before generating responses. Instead of relying solely on the model's parametric knowledge (learned during training), RAG systems combine three components:

Retrieval Component: Searches external knowledge sources for relevant context
Augmentation Component: Injects retrieved context into the prompt
Generation Component: LLM produces response using both its knowledge and retrieved context

Without RAG:

User Query --> LLM --> Response (limited to training data, prone to hallucination)

With RAG:

User Query --> Retrieve Relevant Docs --> LLM + Retrieved Context --> Accurate Response + Citations

The Problems RAG Solves

LLMs have several fundamental limitations that RAG addresses:

Knowledge cutoff: Models only know information up to their training date
Hallucinations: Models generate plausible-sounding but incorrect information with high confidence
Domain specificity: General models lack specialized company, industry, or regulatory knowledge
Source attribution: Models cannot cite where their information comes from
Cost of updates: Retraining or fine-tuning for every knowledge change is prohibitively expensive

Why RAG is Critical for Agentic AI

Long-term memory: Agents retrieve from past conversations, experiences, and accumulated knowledge
Grounded responses: Agents cite sources and provide verifiable information for decision transparency
Dynamic knowledge: Agents access up-to-date information without retraining
Domain expertise: Enables agents to operate in specialized domains (healthcare, legal, finance) requiring expert knowledge
Privacy: Keeps proprietary data on-premises rather than embedded in model weights
Provenance: Provides citation and audit trail for agent decisions -- critical for compliance

NCP-AAI Exam Coverage

RAG systems appear prominently across multiple exam domains:

NCP-AAI Exam: RAG Coverage by Domain

Domain	RAG Topics	Exam Weight
Knowledge Integration and Agent Development	RAG pipelines, document processing, chunking strategies, embedding models	15%
Agent Design and Cognition	Memory systems, semantic search, knowledge retrieval, agentic RAG	15%
NVIDIA Platform Implementation	Vector databases, NV-Embed, NeMo Retriever, NVIDIA NIM integration	13%
Evaluation and Monitoring	Retrieval quality metrics (NDCG, MRR), relevance scoring, faithfulness	5%

Estimated RAG-Related Questions: 12-18 out of 60-70 total questions (20-25%)

RAG Pipeline Architecture: The 5 Stages

A production RAG pipeline consists of five distinct stages. The NCP-AAI exam tests each stage in depth, including component trade-offs, NVIDIA-specific tooling, and optimization strategies.

Stage Overview

+-------------------------------------------------------------+
|                   1. DATA INGESTION                         |
|  Documents (PDF, SQL, APIs) --> Load --> Parse --> Clean     |
+-------------------------------------------------------------+
                            |
                            v
+-------------------------------------------------------------+
|                   2. CHUNKING                               |
|  Full Documents --> Split --> Chunks (with overlap)          |
|  Strategy: Semantic / Fixed-size / Document-based / Agentic |
+-------------------------------------------------------------+
                            |
                            v
+-------------------------------------------------------------+
|                   3. EMBEDDING & INDEXING                    |
|  Chunks --> Embedding Model --> Vectors --> Vector Database  |
|  (e.g., NV-Embed-v2, Llama-Embed-Nemotron-8B)              |
+-------------------------------------------------------------+
                            |
                            v
+-------------------------------------------------------------+
|                   4. RETRIEVAL (Query-Time)                  |
|  User Query --> Embed Query --> Search Vector DB --> Top-K   |
|  Optional: Reranking, Hybrid Search, Multi-hop              |
+-------------------------------------------------------------+
                            |
                            v
+-------------------------------------------------------------+
|                   5. GENERATION                              |
|  Query + Retrieved Chunks --> LLM --> Final Response         |
|  Prompt Engineering: "Use only provided context..."          |
+-------------------------------------------------------------+

Advanced RAG Pipeline (2025-2026 Best Practices)

Production systems add several stages to the basic pipeline:

User Query --> Query Transformation (rewrite, expand, decompose)
            |
            v
Hybrid Retrieval (Vector + Keyword + Knowledge Graph)
            |
            v
Reranking (Cross-encoder reorders by relevance score)
            |
            v
Context Compression (Remove irrelevant parts, extract key sentences)
            |
            v
Multi-hop Reasoning (Follow-up retrieval if needed)
            |
            v
Response Generation (with citations and source attribution)
            |
            v
Guardrails & Validation (check for hallucinations, enforce grounding)

Stage 1: Document Processing and Ingestion

Data Source Types

Structured Data:

SQL Databases (PostgreSQL, MySQL, Oracle)
NoSQL (MongoDB, Cassandra, DynamoDB)
Data Warehouses (Snowflake, BigQuery, Redshift)

Unstructured Data:

Documents (PDF, DOCX, TXT, Markdown)
Web Content (HTML pages, wikis, documentation sites)
Code Repositories (GitHub, GitLab, Bitbucket)

Semi-Structured Data:

APIs (REST, GraphQL, gRPC)
Messaging (Slack, Discord, email archives)
Collaboration Tools (Notion, Confluence, SharePoint)

Document Parsing Best Practices

Challenge: Extract clean text from complex documents (PDFs with tables, images, multi-column layouts)

# Basic PDF parsing with LlamaIndex
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_dir="./docs",
    required_exts=[".pdf", ".docx", ".txt"]
).load_data()

# Advanced: Parse tables and images from complex PDFs
from llama_index.readers.file import PyMuPDFReader

reader = PyMuPDFReader()
documents = reader.load_data(file_path="complex_report.pdf")

# LangChain document loading
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("technical_documentation.pdf")
documents = loader.load()

NCP-AAI Exam Tip: Know which parser to use for different document types:

PDFs with tables: PyMuPDF or Unstructured
HTML/Web: BeautifulSoup or Trafilatura
Code files: Tree-sitter (preserves syntax structure)
Images/Scans: OCR (Tesseract, AWS Textract) then text extraction
Multimodal PDFs: NVIDIA NeMo Retriever OCR (15x throughput improvement over open-source alternatives)

Metadata Extraction

Capturing metadata during ingestion is critical for downstream filtering and retrieval precision. Without metadata, every query must rely entirely on semantic similarity, which misses important contextual signals.

Essential metadata fields:

Source: File path, URL, database table, API endpoint
Date: Created, modified, publication date (enables temporal filtering)
Author: Document creator or contributor (enables authority weighting)
Category: Department, topic, document type (enables scoped search)
Access level: Public, internal, confidential (enables security filtering)
Version: Document version number (enables latest-version preference)
Language: Document language (enables multilingual filtering)

Why metadata matters for RAG quality:

Metadata enables query-time filtering that dramatically improves precision. For example, "Find only documents from Q4 2025" or "Search only engineering team documentation" reduces the search space before vector similarity is even computed. This is faster and more precise than relying on embeddings alone.

Implementation pattern:

# Attach metadata during ingestion
for doc in documents:
    doc.metadata["source"] = doc.file_path
    doc.metadata["department"] = classify_department(doc)
    doc.metadata["date"] = extract_date(doc)
    doc.metadata["access_level"] = determine_access(doc)

# Use metadata filtering at query time
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {"department": "engineering", "date": {"$gte": "2025-01-01"}}
    }
)

NCP-AAI Exam Tip: The exam tests whether you understand that metadata filtering is complementary to vector search, not a replacement. Best practice is to use metadata to narrow the search space, then vector similarity to rank within that space.

Stage 2: Chunking Strategies (Most Critical Decision)

Why Chunking Matters

Chunking is the number-one factor impacting RAG performance, responsible for 30-40% of retrieval quality.

Exam Trap

The NCP-AAI exam frequently tests chunking trade-offs. Too-large chunks lose vector specificity and retrieve irrelevant context. Too-small chunks lose context and provide incomplete information to the LLM. The correct answer is never "always use the smallest/largest chunks" -- it depends on the document type and retrieval requirements.

Chunking Strategy #1: Fixed-Size Chunking (Baseline)

Description: Split text into chunks of fixed token or character count with configurable overlap.

Best for: General-purpose RAG, when documents lack clear structure, mixed content types.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # ~512 tokens (good for most embeddings)
    chunk_overlap=50,     # 10% overlap to preserve context at boundaries
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Respect paragraph boundaries
)

chunks = splitter.split_documents(documents)

Pros:

Simple, fast, predictable chunk sizes
Works with any document type
Easy to optimize (tune size and overlap)

Cons:

May break sentences or concepts mid-thought
Does not respect document structure (headings, sections)

Performance: Baseline (1.0x retrieval quality)

NCP-AAI Exam Tip: Fixed-size chunking is the most common baseline approach. Know when it is sufficient and when to upgrade.

Chunking Strategy #2: Semantic Chunking (State of the Art)

Description: Dynamically split based on semantic coherence using embeddings to detect topic boundaries.

Best for: High-quality RAG where context preservation is critical, structured documents (articles, reports, manuals).

from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings

# Uses embedding similarity to detect topic boundaries
semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation"
    breakpoint_threshold_amount=85  # Split at 85th percentile similarity drop
)

chunks = semantic_splitter.split_documents(documents)

How it works:

Embed each sentence individually
Calculate cosine similarity between consecutive sentence embeddings
Split when similarity drops significantly (topic change detected)

Pros:

Preserves semantic coherence (each chunk discusses one topic)
15-25% better retrieval quality than fixed-size

Cons:

Slower (requires embedding every sentence during ingestion)
Variable chunk sizes (may exceed context window)

Performance: 1.2-1.3x retrieval quality vs. fixed-size

Chunking Strategy #3: Document-Based Chunking

Description: Split based on document structure -- headings, sections, paragraphs, or code functions.

Best for: Structured documents (Markdown, HTML, code files) where the author imposed meaningful boundaries.

from langchain.text_splitter import MarkdownHeaderTextSplitter

# Split Markdown by headers (preserves hierarchy)
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = markdown_splitter.split_text(markdown_document)
# Each chunk includes header hierarchy as metadata
# Example: {"Header 1": "NCP-AAI Guide", "Header 2": "RAG Systems"}

Pros:

Respects author's intended structure
Metadata enrichment from section titles and hierarchy
Natural chunk boundaries

Cons:

Only works for well-structured documents
Chunk size highly variable (a section could be 50 or 5,000 tokens)

Performance: 1.15-1.25x retrieval quality (when structure is meaningful)

Chunking Strategy #4: Agentic Chunking (Emerging)

Description: Use an LLM to intelligently determine chunk boundaries based on semantic understanding.

Best for: Complex documents requiring human-like comprehension (legal contracts, technical narratives, mixed-format content).

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# LLM determines optimal chunk boundaries
agentic_prompt = PromptTemplate(
    template="""Analyze this text and determine logical chunk boundaries
    where topics change. Mark boundaries with [SPLIT].

    Text: {text}

    Output the text with [SPLIT] markers:"""
)

agentic_chunker = LLMChain(llm=llm, prompt=agentic_prompt)
marked_text = agentic_chunker.run(document.page_content)
chunks = marked_text.split("[SPLIT]")

Pros:

Highest semantic quality (simulates human chunking decisions)
Handles complex documents (legal, technical, narrative) that lack clear structure

Cons:

Expensive (LLM call per document)
Slow (not suitable for real-time ingestion of large corpora)

Performance: 1.3-1.4x retrieval quality (highest, but costly)

Chunking Strategy #5: Hierarchical Chunking

Description: Create parent-child chunk relationships where summaries serve as parents and detailed sections as children.

Best for: Technical documentation, long-form content, multi-level retrieval.

Example: Chapter summary (parent) links to section details (children). Retrieve the summary first; drill down to children if the agent needs more detail.

Pros:

Enables multi-level retrieval (overview first, detail on demand)
Excellent for iterative agentic retrieval

Cons:

Complex to implement, higher storage overhead
Requires careful parent-child linking

Chunking Strategy #6: Sliding Window Chunking

Description: Overlapping chunks with configurable stride (e.g., 512 tokens with 128-token overlap).

Best for: Precision-critical applications where context loss at boundaries is unacceptable.

Total chunks from a document:

num_chunks = ceil((doc_length - chunk_overlap) / (chunk_size - chunk_overlap))

Storage overhead from overlap:

overhead_ratio = chunk_size / (chunk_size - chunk_overlap)

Example: 512-token chunks with 128-token overlap:

overhead_ratio = 512 / (512 - 128) = 512 / 384 = 1.33x (33% more storage)
A 10,000-token document produces ceil((10000 - 128) / (512 - 128)) = ceil(9872 / 384) = 26 chunks

Overlap sweet spot: 10-20% overlap balances context preservation against storage cost. Above 25% overlap, diminishing returns on retrieval quality with significant storage increase.

NCP-AAI Exam: Chunking Performance Matrix

NCP-AAI Exam: Chunking Strategy Performance Matrix

Strategy	Retrieval Quality	Speed	Cost	Best Use Case
Fixed-size	1.0x (baseline)	Fastest	Lowest	General text, mixed content, prototyping
Semantic	1.2-1.3x	Slow (embedding per sentence)	Medium	Technical docs, reports, high-quality RAG
Document-based	1.15-1.25x	Fast	Low	Structured documents (Markdown, HTML, code)
Agentic	1.3-1.4x	Slowest (LLM per doc)	Highest	Legal, contracts, complex narratives
Hierarchical	1.2-1.3x	Medium	Medium-High	Long technical docs, multi-level retrieval
Sliding Window	1.1-1.15x	Fast	Low-Medium (33% overhead)	Precision-critical, legal citations

Optimal Chunk Size by Content Type

Content Type	Recommended Chunk Size	Overlap	Rationale
General text	512 tokens	50 tokens (10%)	Balanced specificity and context
Technical docs	300-500 tokens	50-100 tokens	Preserves complete technical concepts
Code documentation	200-400 tokens	25-50 tokens	Complete functions and classes
Legal/compliance	400-600 tokens	100-150 tokens	Maintains regulatory context and clauses
Chat/FAQ	100-200 tokens	0-25 tokens	Short, self-contained Q&A pairs
Research papers	400-800 tokens	100-200 tokens	Preserves arguments and citations
Code repositories	50-200 lines (by function)	10 lines	Preserves function boundaries

NCP-AAI Exam Strategy: Be able to recommend both the chunking strategy and chunk size based on use case requirements. The exam presents scenarios with specific document types and asks you to choose.

Chunk Enrichment Techniques

Beyond basic chunking, enrichment techniques add context that improves retrieval quality:

1. Contextual Headers: Prepend section headers and document titles to each chunk so the embedding captures the broader context:

def enrich_chunk_with_headers(chunk, doc_title, section_title):
    """Add contextual headers to improve embedding quality."""
    enriched_text = f"Document: {doc_title}\nSection: {section_title}\n\n{chunk.text}"
    chunk.text = enriched_text
    return chunk

2. Summary Augmentation: Generate a brief summary of each chunk and prepend it. The summary helps the embedding capture the main topic even when the chunk contains highly specific details.

3. Question Generation: Generate hypothetical questions that each chunk could answer, and store them as metadata. At query time, match user questions against these generated questions for better retrieval.

def generate_chunk_questions(chunk, llm):
    """Generate hypothetical questions this chunk answers."""
    prompt = f"Generate 3 questions that the following text answers:\n\n{chunk.text}"
    questions = llm.generate(prompt)
    chunk.metadata["generated_questions"] = questions
    return chunk

4. Entity Tagging: Extract named entities (people, products, organizations, dates) and store them as metadata for hybrid filtering.

These enrichment techniques add 10-20% processing time during ingestion but can improve retrieval quality by 10-15%, particularly for ambiguous queries.

Stage 3: Embedding Models and Indexing

Embedding Model Selection (2025-2026)

Embedding models convert text chunks into dense vector representations that capture semantic meaning. The choice of embedding model directly impacts retrieval quality, latency, and cost.

Top Embedding Models for NCP-AAI:

Model	Dimensions	MTEB Score	Cost	Best For
NV-Embed-v2	4096	72.31 (MTEB #1, Aug 2024)	Medium	NVIDIA ecosystem, highest quality
Llama-Embed-Nemotron-8B	4096	69.46 (MMTEB #1, Oct 2025)	Medium	Multilingual, cross-lingual tasks
text-embedding-3-large	3072	64.6	Low	General-purpose, OpenAI ecosystem
text-embedding-3-small	1536	62.3	Very Low	Budget, speed-critical
Cohere embed-v3	1024	64.5	Medium	Multilingual
BGE-large-en-v1.5	1024	63.9	Free	Open-source, self-hosted

NV-Embed-v2 achieved the number-one position on the Massive Text Embedding Benchmark (MTEB) with a score of 72.31 across 56 text embedding tasks. It also holds the top position in the retrieval sub-category with a score of 62.65 across 15 tasks. The model uses a novel architecture where the LLM attends to latent vectors for improved pooled embedding output, combined with a two-staged instruction tuning method and hard-negative mining.

Llama-Embed-Nemotron-8B is NVIDIA's newer multilingual embedding model that ranked first on the Multilingual MTEB (MMTEB) leaderboard. It demonstrates superior performance across retrieval, classification, and semantic textual similarity tasks, excelling in challenging multilingual scenarios including low-resource languages and cross-lingual setups.

Key Concept

Higher embedding dimensions do not always mean better performance. The NCP-AAI exam tests whether you understand that latency, storage cost, and diminishing returns above 1024 dimensions must be weighed against marginal quality improvements. 4096-dim embeddings are approximately 2.5x slower to search and require 4x more vector DB storage than 1024-dim embeddings.

Similarity Metrics

Cosine Similarity measures the angle between two vectors, producing a value between -1 and 1:

                    A . B           sum(A_i * B_i)
cos(theta) = --------------- = -------------------------
              ||A|| * ||B||    sqrt(sum(A_i^2)) * sqrt(sum(B_i^2))

Where:

A . B is the dot product of vectors A and B
||A|| and ||B|| are the L2 norms (magnitudes) of each vector
Result range: -1 (opposite) to 1 (identical direction)
For normalized embeddings, cosine similarity equals the dot product

When to use: Text embeddings (most common). Insensitive to vector magnitude -- focuses on semantic direction.

Euclidean Distance (L2):

d(A, B) = sqrt(sum((A_i - B_i)^2))

When to use: When magnitude matters (e.g., image embeddings). Smaller distance = more similar.

Dot Product:

A . B = sum(A_i * B_i)

When to use: Pre-normalized embeddings (faster than cosine -- no normalization step).

NCP-AAI Exam Tip: Know which metric to use for different embedding types. Cosine similarity is the default for text; dot product for pre-normalized vectors; Euclidean for when scale matters.

NVIDIA NIM Embedding Integration

For NVIDIA-ecosystem RAG deployments, you can use NIM to serve embedding models with optimized throughput:

import requests
import json

# Using NVIDIA NIM embedding endpoint
NIM_EMBEDDING_URL = "http://localhost:8000/v1/embeddings"

def embed_with_nim(texts, model="nvidia/nv-embed-v2"):
    """Generate embeddings using NVIDIA NIM embedding service."""
    response = requests.post(
        NIM_EMBEDDING_URL,
        json={
            "input": texts,
            "model": model,
            "encoding_format": "float"
        }
    )
    result = response.json()
    return [item["embedding"] for item in result["data"]]

# Batch embed chunks for indexing
chunk_texts = [chunk.page_content for chunk in chunks]
embeddings = embed_with_nim(chunk_texts)

# NIM handles automatic batching and GPU optimization
# 3x better throughput than open-source embedding servers

NIM embedding advantages over self-hosted:

Automatic request batching for optimal GPU utilization
TensorRT-optimized model serving (3-5x lower latency)
Production-ready health checks, metrics, and error handling
Simple REST API compatible with LangChain and LlamaIndex integrations

Indexing Code Examples

# LlamaIndex with Pinecone
from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings import OpenAIEmbeddings
from llama_index.vector_stores import PineconeVectorStore
import pinecone

pinecone.init(api_key="your-key", environment="us-west1-gcp")
pinecone_index = pinecone.Index("ncp-aai-docs")
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

embed_model = OpenAIEmbeddings(model="text-embedding-3-large")
index = VectorStoreIndex.from_documents(
    documents,
    vector_store=vector_store,
    embed_model=embed_model
)

index.storage_context.persist(persist_dir="./storage")

# LangChain with Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Stage 4: Retrieval (Query-Time)

Retrieval Method #1: Semantic Search (Baseline)

How it works: Embed the user query, find K nearest neighbor vectors in the database.

# Simple semantic search
query_engine = index.as_query_engine(
    similarity_top_k=5  # Retrieve top 5 most similar chunks
)
response = query_engine.query("What is the NCP-AAI exam structure?")

Pros: Fast, works well for most queries Cons: May miss exact keyword matches (product names, codes, acronyms) Performance: Baseline (1.0x)

Retrieval Method #2: Hybrid Search (State of the Art)

How it works: Combine vector similarity search with keyword search (BM25) using Reciprocal Rank Fusion (RRF) to merge results.

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.retrievers import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever

# Vector retriever (semantic)
vector_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10
)

# Keyword retriever (BM25)
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore,
    similarity_top_k=10
)

# Fusion retriever (combines both with Reciprocal Rank Fusion)
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    mode="reciprocal_rerank"  # RRF fusion algorithm
)

query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)
response = query_engine.query("NCP-AAI exam Domain 2 percentage")

Why hybrid is better: Semantic search catches conceptual matches ("What are the certification requirements?") while keyword search catches exact terms ("NCP-AAI" or "Domain 2"). Together they achieve 15-25% better recall than either alone.

Reciprocal Rank Fusion (RRF) explained:

RRF merges ranked lists from multiple retrievers by assigning each document a fused score based on its rank in each list:

RRF_score(doc) = sum over all retrievers: 1 / (k + rank_in_retriever)

Where k is a constant (typically 60) that controls how much to penalize low-ranked results. Documents are then sorted by their fused RRF score.

Example: A document ranked #2 in vector search and #5 in keyword search:

Vector contribution: 1/(60+2) = 0.0161
Keyword contribution: 1/(60+5) = 0.0154
RRF score: 0.0315

A document ranked #1 in vector search but not in keyword results at all:

Vector contribution: 1/(60+1) = 0.0164
Keyword contribution: 0
RRF score: 0.0164

The first document scores higher because it appears in both result sets, which is the key insight of RRF -- documents that are relevant by multiple criteria are ranked higher.

Performance: 1.2-1.3x retrieval quality

Retrieval Method #3: Reranking (Essential for High Quality)

How it works: Retrieve a larger candidate set (20-50), then rerank with a cross-encoder model that scores query-document relevance more accurately.

Two-stage retrieval pipeline:

Stage 1 (Fast, ~50ms): Bi-encoder vector search --> Top 20-50 candidates
Stage 2 (Accurate, ~200ms): Cross-encoder reranking --> Top 3-5 for context

# LlamaIndex reranking with Cohere
from llama_index.postprocessor import CohereRerank

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=20  # Over-retrieve candidates
)

reranker = CohereRerank(
    api_key="your-cohere-key",
    top_n=5  # Return top 5 after reranking
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[reranker]
)

response = query_engine.query("Explain NVIDIA NIM deployment for RAG")

Reranker Models:

Cohere Rerank-3: Managed API, easy integration, strong multilingual support
BGE-reranker-v2: Open source, self-hosted option
NVIDIA NeMo Reranker (Nemotron Reranking NIM): Optimized for NVIDIA infrastructure, 1.6x throughput vs. open-source alternatives

NVIDIA NIM Reranking Example:

import requests

NIM_RERANKER_URL = "http://localhost:8001/v1/ranking"

def rerank_with_nim(query, documents, top_n=5):
    """Rerank documents using NVIDIA NIM reranking service."""
    response = requests.post(
        NIM_RERANKER_URL,
        json={
            "model": "nvidia/nv-rerankqa-mistral-4b-v3",
            "query": {"text": query},
            "passages": [{"text": doc} for doc in documents],
            "top_n": top_n
        }
    )
    result = response.json()
    # Returns documents sorted by relevance score
    return [(r["index"], r["logit"]) for r in result["rankings"]]

# Retrieve 20 candidates with vector search
candidates = vector_search(query, top_k=20)

# Rerank to top 5 with cross-encoder
reranked = rerank_with_nim(
    query="How does NVIDIA NIM optimize RAG latency?",
    documents=[c.text for c in candidates],
    top_n=5
)

# Use top 5 reranked documents as context for generation
context = [candidates[idx].text for idx, score in reranked]

Cross-encoder vs. bi-encoder explained:

A bi-encoder (used in Stage 1) encodes query and document independently, then compares embeddings. This is fast because documents can be pre-embedded, but it misses fine-grained query-document interactions.

A cross-encoder (used in reranking) processes query and document together as a single input, allowing attention between query and document tokens. This captures richer interactions but is slower because it cannot pre-compute document representations.

The two-stage approach combines the speed of bi-encoders (narrow from millions to 20-50 candidates) with the accuracy of cross-encoders (rerank the small candidate set).

Pros: 20-30% better precision (fewer irrelevant results in final context) Cons: Adds 150-250ms latency, additional cost per query

Performance: 1.3-1.4x retrieval quality

NCP-AAI Tip: Know when reranking justifies the latency cost. Precision-critical tasks (legal, medical, customer support) benefit most; low-latency chat may not.

Retrieval Decision Matrix

Use Case	Recommended Method	top_k	Rationale
General Q&A	Hybrid search	5	Balance speed and quality
Exact match critical	Hybrid + reranking	3	Legal docs, product codes
Low latency required	Semantic search only	3-5	Real-time chat applications
High precision needed	Hybrid + reranking + compression	3	Customer support, medical
Multi-hop reasoning	Agentic RAG (iterative)	5 per hop	Complex research tasks

Stage 5: Response Generation

Prompt Engineering for RAG

Basic RAG Prompt:

rag_prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""

Advanced RAG Prompt (with citations and grounding):

advanced_rag_prompt = """You are an expert assistant. Answer the question
using ONLY the provided context.

Context:
{context}

Instructions:
1. Answer based solely on the context above
2. If the context doesn't contain the answer, respond:
   "The provided documents don't contain this information."
3. Cite sources using [Source X] notation
4. If context is ambiguous, acknowledge uncertainty
5. Never extrapolate beyond what the sources state

Question: {question}

Answer (with citations):"""

Handling Hallucinations

Exam Trap

A common exam mistake is assuming that RAG eliminates hallucinations entirely. RAG reduces hallucinations but does not prevent them. The exam tests whether you know that explicit grounding instructions, constrained decoding, and post-hoc verification are still necessary even with RAG.

Problem: LLM generates plausible-sounding but false information despite having retrieved context.

Solutions:

# 1. Require minimum similarity threshold
from llama_index.core.postprocessor import SimilarityPostprocessor
similarity_filter = SimilarityPostprocessor(similarity_cutoff=0.7)

# 2. Use structured output for citations
from pydantic import BaseModel, Field
from typing import List

class RAGResponse(BaseModel):
    answer: str = Field(description="Answer to the question")
    sources: List[str] = Field(description="List of source document IDs used")
    confidence: float = Field(description="Confidence score 0-1")

# 3. Implement NeMo Guardrails
from nemoguardrails import LLMRails

rails_config = """
define flow check_hallucination:
  if bot response not grounded in context:
    bot say "I don't have reliable information on this."
"""
rails = LLMRails.from_config(rails_config)
response = rails.generate(messages=[{"role": "user", "content": query}])

Additional anti-hallucination strategies:

Constrained decoding: Enforce extractive answers (no generalization)
Confidence thresholds: Return "I don't know" if retrieved context similarity is below threshold
Post-hoc verification: Check answer entailment against retrieved context
Smaller, instruction-tuned models: Less prone to "creative" generation beyond context

LangChain RAG Implementation (End-to-End)

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

# 1. Load documents
loader = PyPDFLoader("nvidia_documentation.pdf")
documents = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50
)
chunks = splitter.split_documents(documents)

# 3. Embed and index
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# 5. Build RAG chain
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" = inject all chunks into prompt
    retriever=retriever,
    return_source_documents=True
)

# 6. Query
result = qa_chain({"query": "What is NVIDIA NIM?"})
print(result["result"])
print(result["source_documents"])

Advanced RAG Patterns for the NCP-AAI Exam

Pattern 1: Agentic RAG

Key Concept

Agentic RAG is the evolution beyond traditional RAG. Instead of always retrieving, the agent autonomously decides when, what, and how much to retrieve based on query analysis and confidence assessment. This is a high-priority topic for the NCP-AAI exam.

Key capabilities of Agentic RAG:

Adaptive Retrieval: Agent decides WHEN to retrieve (not every query needs retrieval)
Multi-hop Reasoning: Agent retrieves, analyzes, then retrieves again based on findings
Query Decomposition: Agent breaks complex queries into subqueries for parallel retrieval
Self-Correction: Agent evaluates retrieval quality and re-retrieves if results are insufficient

Decision Framework:

def agentic_retrieval(query, agent, knowledge_base):
    # Step 1: Does this query require external knowledge?
    if agent.can_answer_from_parametric_knowledge(query):
        return agent.generate(query)  # Skip retrieval

    # Step 2: Retrieve
    context = knowledge_base.retrieve(query, top_k=5)

    # Step 3: Evaluate retrieval quality
    if agent.evaluate_relevance(query, context) < 0.7:
        # Reformulate and re-retrieve
        refined_query = agent.reformulate_query(query, context)
        context = knowledge_base.retrieve(refined_query, top_k=10)

    # Step 4: Check sufficiency
    if agent.is_information_sufficient(query, context):
        return agent.generate(query, context)
    else:
        # Multi-hop: identify knowledge gaps and retrieve more
        gaps = agent.identify_gaps(query, context)
        for gap in gaps:
            additional = knowledge_base.retrieve(gap, top_k=3)
            context.extend(additional)
        return agent.generate(query, context)

NVIDIA implementation with LangGraph and ReAct agents:

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool

# Turn query engine into a tool for the agent
query_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="knowledge_base",
    description="Search company knowledge base for factual information"
)

# Agent can retrieve multiple times, reason about results
agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)

# Multi-hop query: agent retrieves NCP-AAI info, then AWS info, then compares
response = agent.chat("Compare NCP-AAI and AWS AI Practitioner exam formats")

LangGraph Agentic RAG with Router:

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class RAGState(TypedDict):
    query: str
    retrieved_docs: List[str]
    relevance_score: float
    response: str
    needs_retrieval: bool

def route_query(state: RAGState) -> str:
    """Router node: decide if retrieval is needed."""
    if requires_factual_knowledge(state["query"]):
        return "retrieve"
    return "generate_direct"

def retrieve(state: RAGState) -> RAGState:
    """Retrieval node: search knowledge base."""
    docs = knowledge_base.search(state["query"], top_k=5)
    state["retrieved_docs"] = docs
    return state

def grade_documents(state: RAGState) -> str:
    """Grader node: evaluate retrieval quality."""
    score = evaluate_relevance(state["query"], state["retrieved_docs"])
    state["relevance_score"] = score
    if score < 0.7:
        return "rewrite_query"  # Poor results, try again
    return "generate"  # Good results, proceed

def rewrite_query(state: RAGState) -> RAGState:
    """Rewriter node: reformulate query for better retrieval."""
    state["query"] = llm.rewrite(state["query"], state["retrieved_docs"])
    return state

# Build the agentic RAG graph
workflow = StateGraph(RAGState)
workflow.add_node("route", route_query)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade", grade_documents)
workflow.add_node("rewrite", rewrite_query)
workflow.add_node("generate", generate_response)

workflow.add_edge("route", "retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", grade_documents,
    {"rewrite_query": "rewrite", "generate": "generate"})
workflow.add_edge("rewrite", "retrieve")  # Loop back for re-retrieval
workflow.add_edge("generate", END)

This LangGraph pattern implements the full agentic RAG loop: route, retrieve, grade, optionally rewrite and re-retrieve, then generate. The exam tests whether you understand each node's role and when re-retrieval is triggered.

Key agentic RAG patterns tested on NCP-AAI:

Router pattern: Agent classifies query type and routes to specialized retrieval strategies (factual vs. conceptual vs. navigational)
Grader pattern: Agent evaluates retrieved document relevance before passing to generation
Hallucination checker: Agent verifies generated response is grounded in retrieved context
Query decomposition: Agent breaks complex query into simpler subqueries, retrieves for each, and synthesizes

Pattern 2: Graph RAG

Graph RAG combines vector embeddings with knowledge graphs to capture entity relationships that pure vector similarity cannot represent.

How Graph RAG works:

Entity Extraction: Extract entities (people, products, concepts) from documents using NER or LLM extraction
Relationship Mapping: Identify and store relationships between entities (e.g., "reports to", "depends on", "is part of")
Graph Construction: Build a knowledge graph where nodes are entities and edges are relationships
Hybrid Query: For each user query, perform both vector search (for semantic context) and graph traversal (for relational context)
Merged Context: Combine vector-retrieved chunks with graph-traversed relationship data before generation

When Graph RAG outperforms standard RAG:

Relational queries: "Who reports to the VP of Engineering who manages the NIM team?" requires traversing organizational relationships
Multi-entity queries: "What products use the same GPU as the NIM embedding service?" requires entity linking
Causal chains: "What caused the outage in production RAG service last week?" requires traversing incident-to-root-cause relationships
Compliance queries: "Which regulations apply to our healthcare RAG deployment?" requires mapping regulatory entity relationships

When standard RAG is sufficient:

Simple factual lookups ("What is the NCP-AAI exam duration?")
Conceptual questions ("Explain the difference between RAG and fine-tuning")
Documentation search ("How do I deploy NIM?")

Trade-offs:

Setup cost: Significantly higher -- requires entity extraction pipeline and graph database (Neo4j, Amazon Neptune)
Maintenance: Graph must be updated when relationships change
Query latency: Graph traversal adds 100-500ms depending on depth
Accuracy: For relational queries, Graph RAG can achieve 30-50% better accuracy than vector-only RAG

NCP-AAI Exam Tip: The exam tests whether you can identify when Graph RAG is necessary vs. when standard vector RAG suffices. If the question describes relational or multi-entity queries, Graph RAG is likely the answer. For simple factual retrieval, standard RAG is sufficient and less complex.

Pattern 3: Hybrid RAG

Blends multiple retrieval strategies with fusion:

Semantic search (vector similarity) for conceptual queries
Keyword search (BM25, TF-IDF) for exact term matching
Metadata filtering (date, author, category) for scoped queries
Knowledge graph traversal for relational queries
Reciprocal Rank Fusion to merge results from all sources

Pattern 4: Modular RAG

Separates retriever, reranker, and generator into independently deployable components:

Swap components without full system redesign (e.g., upgrade reranker without touching retriever)
A/B test different retrieval strategies side by side
Optimize each component independently for latency, accuracy, and cost

Advanced Retrieval Techniques

Query Transformation (HyDE -- Hypothetical Document Embeddings):

from llama_index.core.indices.query.query_transform import HyDEQueryTransform

# Generate a hypothetical answer, embed THAT, retrieve similar docs
hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(base_query_engine, query_transform=hyde)

# Original query: "NCP-AAI exam difficulty"
# HyDE generates: "The NCP-AAI exam is moderately difficult, requiring..."
# Embeds the hypothetical answer (closer to documents than the question)

Why HyDE works: Documents are semantically closer to answers than to questions. By embedding a hypothetical answer, the search finds more relevant documents.

Multi-Query RAG:

from langchain.retrievers.multi_query import MultiQueryRetriever

# Generate multiple query variations to improve retrieval coverage
retriever = MultiQueryRetriever.from_llm(
    retriever=base_retriever,
    llm=llm
)
# Single user query generates 3-5 variations, each retrieves independently
# Results are merged and deduplicated

Context Compression:

from llama_index.core.postprocessor import LongContextReorder

# Reorder chunks: most relevant at edges (beginning/end), less relevant in middle
# Addresses "lost in the middle" phenomenon where LLMs attend poorly to mid-context
reorder = LongContextReorder()

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[reorder]
)

Multi-Agent RAG Orchestration

Pattern A: Retrieval Specialist Agent

Dedicated agent manages all retrieval operations
Other agents request knowledge via API
Centralized caching and optimization

Pattern B: Parallel Retrieval

Multiple agents retrieve from different sources simultaneously
Coordinator aggregates and deduplicates results
Faster for multi-source queries

Pattern C: Iterative Refinement

Agent retrieves, analyzes, identifies gaps, retrieves again
Continues until sufficient information gathered
Common in research and analysis agents

Self-Reflective RAG

The agent evaluates its own retrieval quality before generating:

Evaluation questions the agent asks itself:

Is the retrieved context relevant to my query?
Is the information sufficient to answer completely?
Are there contradictions in retrieved documents?
Do I need additional retrieval?

Actions based on reflection:

Irrelevant: Reformulate query and re-retrieve with different keywords or broader scope
Insufficient: Expand search (increase top_k, broaden query terms, search additional knowledge bases)
Contradictory: Retrieve authoritative sources to resolve conflicts, prioritize by recency and source authority
Sufficient: Proceed to generation with confidence

Self-reflective RAG implementation pattern:

def self_reflective_rag(query, knowledge_base, llm, max_attempts=3):
    """RAG with self-reflection loop for quality assurance."""
    for attempt in range(max_attempts):
        # Retrieve
        docs = knowledge_base.search(query, top_k=5)

        # Self-reflect: evaluate retrieval quality
        reflection = llm.evaluate(
            f"Are these documents relevant and sufficient to answer: '{query}'?\n"
            f"Documents: {docs}\n"
            f"Rate relevance 0-1 and explain gaps:"
        )

        if reflection.relevance_score >= 0.7:
            # Quality sufficient, generate response
            return llm.generate(query, context=docs)
        else:
            # Quality insufficient, reformulate query based on reflection
            query = llm.reformulate(query, reflection.gaps, docs)

    # Max attempts reached, generate with best available context
    return llm.generate(query, context=docs, disclaimer=True)

This pattern ensures the agent does not generate responses from poor-quality context. The exam tests whether you understand that self-reflection adds latency (each reflection loop is an additional LLM call) but significantly improves response quality for complex queries where initial retrieval may miss the mark.

RAG Evaluation and Metrics

Retrieval Quality Metrics

NDCG measures ranking quality by weighting relevant results higher when they appear at top positions. It uses graded relevance (not just binary relevant/not-relevant).

Discounted Cumulative Gain:

            K
DCG@K = sum     (2^rel_i - 1) / log2(i + 1)
           i=1

Normalized DCG:

NDCG@K = DCG@K / IDCG@K

Where:

rel_i = graded relevance of the document at position i (e.g., 0, 1, 2, 3)
IDCG@K = Ideal DCG (the DCG if documents were perfectly ranked by relevance)
Result range: 0.0 to 1.0 (1.0 = perfect ranking)

Example: For a query returning 5 documents with relevance grades [3, 2, 0, 1, 3]:

DCG@5 = (2^3-1)/log2(2) + (2^2-1)/log2(3) + (2^0-1)/log2(4) + (2^1-1)/log2(5) + (2^3-1)/log2(6)
DCG@5 = 7/1 + 3/1.585 + 0/2 + 1/2.322 + 7/2.585 = 7 + 1.893 + 0 + 0.431 + 2.708 = 12.032

Target: NDCG@10 > 0.7 for high-quality production RAG systems.

When to use: When relevance is graded (not just binary) and ranking order matters. The standard metric for search and retrieval evaluation.

Metric	Warning	Critical	Action
Avg similarity score	<0.65	<0.55	Investigate query patterns, reindex
Empty retrieval rate	>10%	>25%	Add documents, expand knowledge base
Retrieval p99 latency	>500ms	>1000ms	Scale vector DB, optimize index
Faithfulness score	<0.85	<0.75	Review grounding prompts, check context quality
End-to-end p99 latency	>3s	>5s	Profile pipeline, add caching

Dimension	RAG	Fine-Tuning
Knowledge type	Factual, domain-specific, frequently updated	Style, format, reasoning patterns
Update frequency	Real-time (add/remove docs instantly)	Requires retraining (hours to days)
Cost	Low per-query ($0.001-0.01)	High upfront ($100-10,000+), low per-query
Hallucination control	Strong (grounded in retrieved sources)	Moderate (can still hallucinate)
Latency	Higher (retrieval + generation)	Lower (generation only)
Data privacy	Data stays in vector DB (not in model weights)	Data embedded in model weights
Scalability	Add unlimited documents without retraining	Knowledge limited by model capacity
Best for	Customer support, documentation search, Q&A	Code generation style, domain reasoning, tone

Database	Type	Best For	Scale	Key Features	NVIDIA Integration
Pinecone	Managed cloud	Production ease	Billions	Serverless, auto-scaling, simple API	Native NIM support
Milvus	Open source	Enterprise scale	Billions	GPU acceleration, highly scalable	NVIDIA GPU optimizations, cuVS
Weaviate	Open source	Hybrid search	Millions-Billions	Vector + keyword + filters, GraphQL	Module ecosystem
Chroma	Embedded	Dev/prototyping	Millions	Lightweight, no server needed	Easy local setup
Qdrant	Open source	High performance	Billions	Rust-based, payload filtering	Fast indexing
FAISS	Library	Research/embedded	Billions	In-memory, GPU-accelerated	Meta library, GPU support

Index Type	Best For	Build Time	Query Speed	Memory
Flat (Exact)	<100K vectors	Instant	Slowest	Lowest
IVF	100K-10M vectors	Medium	Fast	Low
HNSW	1M-100M vectors	Slow	Fastest	High
PQ (Product Quantization)	>100M vectors	Slow	Fast	Lowest

Start Here

Quick Takeaways

What is RAG and Why It Matters for NCP-AAI

Core Concept

The Problems RAG Solves

Why RAG is Critical for Agentic AI

NCP-AAI Exam Coverage

NCP-AAI Exam: RAG Coverage by Domain

RAG Pipeline Architecture: The 5 Stages

Stage Overview

Advanced RAG Pipeline (2025-2026 Best Practices)

Stage 1: Document Processing and Ingestion

Data Source Types

Document Parsing Best Practices

Metadata Extraction

Stage 2: Chunking Strategies (Most Critical Decision)

Why Chunking Matters

Exam Trap

Chunking Strategy #1: Fixed-Size Chunking (Baseline)

Chunking Strategy #2: Semantic Chunking (State of the Art)

Chunking Strategy #3: Document-Based Chunking

Chunking Strategy #4: Agentic Chunking (Emerging)

Chunking Strategy #5: Hierarchical Chunking

Chunking Strategy #6: Sliding Window Chunking

Chunking Overlap Calculations

NCP-AAI Exam: Chunking Performance Matrix

NCP-AAI Exam: Chunking Strategy Performance Matrix

Optimal Chunk Size by Content Type

Chunk Enrichment Techniques

Stage 3: Embedding Models and Indexing

Embedding Model Selection (2025-2026)

Key Concept

Similarity Metrics

Cosine Similarity Formula

NVIDIA NIM Embedding Integration

Indexing Code Examples

Stage 4: Retrieval (Query-Time)

Retrieval Method #1: Semantic Search (Baseline)

Retrieval Method #2: Hybrid Search (State of the Art)

Retrieval Method #3: Reranking (Essential for High Quality)

Retrieval Decision Matrix

Stage 5: Response Generation

Prompt Engineering for RAG

Handling Hallucinations

Exam Trap

LangChain RAG Implementation (End-to-End)

Advanced RAG Patterns for the NCP-AAI Exam

Pattern 1: Agentic RAG

Key Concept

Pattern 2: Graph RAG

Pattern 3: Hybrid RAG

Pattern 4: Modular RAG

Advanced Retrieval Techniques

Multi-Agent RAG Orchestration

Self-Reflective RAG

RAG Evaluation and Metrics

Retrieval Quality Metrics

NDCG (Normalized Discounted Cumulative Gain)

MRR (Mean Reciprocal Rank)

Additional Retrieval Metrics

End-to-End RAG Metrics

RAG Evaluation Code

NVIDIA Platform Integration

NVIDIA NIM for RAG

NVIDIA NeMo Retriever

NVIDIA RAG Blueprint

Performance Optimization with NVIDIA Stack

Master These Concepts with Practice

Common RAG Challenges and Solutions

Problem 1: High Latency

Problem 2: Hallucination Despite RAG

Problem 3: Retrieval Quality Degradation

Problem 4: Context Window Limitations

Problem 5: Cold Start / Low-Quality Initial Results

Production RAG Monitoring and Observability

Key Metrics to Monitor

Alerting Thresholds

Observability Tools

RAG vs. Fine-Tuning: When to Use Which

RAG Security and Compliance