Preporato
NCP-AAINVIDIAAgentic AIRAG

Retrieval-Augmented Generation (RAG) for NCP-AAI Certification: Complete Guide

Preporato TeamDecember 10, 202516 min readNCP-AAI

Retrieval-Augmented Generation (RAG) is the backbone of modern agentic AI systems and a critical component of the NCP-AAI certification exam. RAG appears throughout Domain 2 (Knowledge Integration and Agent Development—15% of the exam) and is tested in practical scenarios across all domains. This comprehensive guide covers everything you need to master RAG for the NCP-AAI exam and production deployments.

Quick Takeaways

  • RAG = Retrieval + Generation: Combine external knowledge retrieval with LLM generation
  • 15% of NCP-AAI exam: Domain 2 focuses heavily on RAG pipeline design and optimization
  • 3 core components: Document processing, retrieval, and response synthesis
  • Chunking strategy: Single most important factor for RAG performance (30-40% impact)
  • Hybrid search: Combining vector + keyword search improves accuracy by 15-25%
  • 2025 best practice: Semantic chunking + reranking + agentic RAG patterns

Preparing for NCP-AAI? Practice with 455+ exam questions

What is RAG? (NCP-AAI Definition)

Core Concept

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Model (LLM) responses by retrieving relevant information from external knowledge sources before generating answers.

Without RAG:

User Query → LLM → Response (limited to training data)

With RAG:

User Query → Retrieve Relevant Docs → LLM + Retrieved Context → Accurate Response

Why RAG Matters for Agentic AI

1. Overcomes LLM Limitations:

  • Knowledge cutoff: LLMs only know data from training (e.g., GPT-4 trained on data up to April 2024)
  • Hallucinations: LLMs generate plausible-sounding but incorrect information
  • Domain specificity: General LLMs lack specialized company/industry knowledge

2. Essential for Intelligent Agents:

  • Long-term memory: Agents retrieve from past conversations and experiences
  • Grounded responses: Agents cite sources and provide verifiable information
  • Dynamic knowledge: Agents access up-to-date information without retraining

3. Production Requirements:

  • Privacy: Keep proprietary data on-premises (not in LLM training data)
  • Cost: Cheaper than fine-tuning LLMs for each knowledge domain
  • Maintainability: Update knowledge base without retraining models

RAG Pipeline Architecture

Standard RAG Pipeline (5 Stages)

┌─────────────────────────────────────────────────────────────┐
│                   1. DATA INGESTION                         │
│  Documents (PDF, SQL, APIs) → Load → Parse                  │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                   2. CHUNKING                               │
│  Full Documents → Split → Chunks (with overlap)             │
│  Strategy: Semantic / Fixed-size / Document-based           │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                   3. EMBEDDING & INDEXING                   │
│  Chunks → Embedding Model → Vectors → Vector Database       │
│  (e.g., NV-Embed-v2, text-embedding-3-large)                │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                   4. RETRIEVAL (Query-Time)                 │
│  User Query → Embed Query → Search Vector DB → Top-K Chunks │
│  Optional: Reranking, Hybrid Search                         │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                   5. GENERATION                             │
│  Query + Retrieved Chunks → LLM → Final Response            │
│  Prompt Engineering: "Use only provided context..."         │
└─────────────────────────────────────────────────────────────┘

Advanced RAG Pipeline (2025 Best Practices)

User Query → Query Transformation (rewrite, expand)
            ↓
Hybrid Retrieval (Vector + Keyword + Knowledge Graph)
            ↓
Reranking (Reorder by relevance score)
            ↓
Context Compression (Remove irrelevant parts)
            ↓
Multi-hop Reasoning (Follow-up retrieval if needed)
            ↓
Response Generation (with citations)
            ↓
Guardrails & Validation (check for hallucinations)

Stage 1: Document Processing and Ingestion

Data Source Types

Structured Data:

  • SQL Databases: PostgreSQL, MySQL, Oracle
  • NoSQL: MongoDB, Cassandra, DynamoDB
  • Data Warehouses: Snowflake, BigQuery, Redshift

Unstructured Data:

  • Documents: PDF, DOCX, TXT, Markdown
  • Web Content: HTML pages, wikis, documentation
  • Code Repositories: GitHub, GitLab, Bitbucket

Semi-Structured Data:

  • APIs: REST, GraphQL, gRPC
  • Messaging: Slack, Discord, email archives
  • Collaboration Tools: Notion, Confluence, SharePoint

Document Parsing Best Practices

Challenge: Extract clean text from complex documents (PDFs, tables, images)

Solutions:

# Basic PDF parsing
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_dir="./docs",
    required_exts=[".pdf", ".docx", ".txt"]
).load_data()

# Advanced: Parse tables and images
from llama_index.readers.file import PyMuPDFReader

# Preserves table structure and extracts images
reader = PyMuPDFReader()
documents = reader.load_data(file_path="complex_report.pdf")

NCP-AAI Exam Tip: Know which parser to use for different document types:

  • PDFs with tables: PyMuPDF or Unstructured
  • HTML/Web: BeautifulSoup or Trafilatura
  • Code files: Tree-sitter (preserves syntax structure)
  • Images/Scans: OCR (Tesseract, AWS Textract) → Text extraction

Stage 2: Chunking Strategies (Most Critical Decision)

Why Chunking Matters

Chunking is the #1 factor impacting RAG performance (30-40% of retrieval quality).

Chunking Tradeoff:

  • Too large: Vector loses specificity, retrieves irrelevant context
  • Too small: Loses context, incomplete information for LLM

Chunking Strategy #1: Fixed-Size Chunking (Baseline)

Description: Split text into chunks of fixed token/character count with overlap

Best for: General-purpose RAG, when documents lack clear structure

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # ~512 tokens (good for most embeddings)
    chunk_overlap=50,     # 10% overlap to preserve context
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Respect paragraph boundaries
)

chunks = splitter.split_documents(documents)

Pros:

  • Simple, fast, predictable
  • Works with any document type

Cons:

  • May split sentences or concepts mid-thought
  • Doesn't respect document structure (headings, sections)

Performance: Baseline (1.0x retrieval quality)

Chunking Strategy #2: Semantic Chunking (SOTA 2025)

Description: Dynamically split based on semantic coherence using embeddings

Best for: High-quality RAG where context preservation is critical

from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings

# Uses embedding similarity to detect topic boundaries
semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation"
    breakpoint_threshold_amount=85  # Split at 85th percentile similarity drop
)

chunks = semantic_splitter.split_documents(documents)

How it works:

  1. Embed each sentence
  2. Calculate similarity between consecutive sentences
  3. Split when similarity drops significantly (topic change detected)

Pros:

  • Preserves semantic coherence (each chunk discusses one topic)
  • 15-25% better retrieval quality than fixed-size

Cons:

  • Slower (requires embedding every sentence)
  • Variable chunk sizes (may exceed context window)

Performance: 1.2-1.3x retrieval quality vs. fixed-size

Chunking Strategy #3: Document-Based Chunking

Description: Split based on document structure (headings, sections, paragraphs)

Best for: Structured documents (Markdown, HTML, code files)

from langchain.text_splitter import MarkdownHeaderTextSplitter

# Split Markdown by headers (preserves hierarchy)
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = markdown_splitter.split_text(markdown_document)

# Each chunk includes header hierarchy as metadata
# Example: {"Header 1": "NCP-AAI Guide", "Header 2": "RAG Systems"}

Pros:

  • Respects author's intended structure
  • Metadata enrichment (section titles, hierarchy)

Cons:

  • Only works for well-structured documents
  • Chunk size highly variable

Performance: 1.15-1.25x retrieval quality (when structure is meaningful)

Chunking Strategy #4: Agentic Chunking (Emerging 2025)

Description: Use LLM to intelligently determine chunk boundaries

Best for: Complex documents requiring human-like understanding

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# LLM determines optimal chunk boundaries
agentic_prompt = PromptTemplate(
    template="""Analyze this text and determine logical chunk boundaries
    where topics change. Mark boundaries with [SPLIT].

    Text: {text}

    Output the text with [SPLIT] markers:"""
)

agentic_chunker = LLMChain(llm=llm, prompt=agentic_prompt)
marked_text = agentic_chunker.run(document.page_content)
chunks = marked_text.split("[SPLIT]")

Pros:

  • Highest semantic quality (simulates human chunking)
  • Handles complex documents (legal, technical, narrative)

Cons:

  • Expensive (LLM call per document)
  • Slow (not suitable for real-time ingestion)

Performance: 1.3-1.4x retrieval quality (highest, but costly)

NCP-AAI Exam: Chunking Decision Matrix

Document TypeRecommended StrategyChunk SizeOverlap
General textFixed-size512 tokens50 tokens (10%)
Technical docsSemantic300-800 tokensN/A (semantic)
Structured (MD, HTML)Document-basedVariableN/A
Legal/contractsAgenticVariableN/A
Code repositoriesDocument-based (by function)50-200 lines10 lines
Chat transcriptsFixed-size with timestamp metadata10-20 messages2 messages

Stage 3: Embedding and Indexing

Embedding Model Selection (2025)

Best Embedding Models for NCP-AAI:

ModelDimensionsPerformanceCostBest For
NV-Embed-v24096SOTA (72.3 MTEB)MediumNVIDIA ecosystem
text-embedding-3-large3072Excellent (64.6 MTEB)LowGeneral-purpose
text-embedding-3-small1536Good (62.3 MTEB)Very LowBudget/speed
Cohere embed-v31024Excellent (64.5 MTEB)MediumMultilingual
BGE-large-en-v1.51024Good (63.9 MTEB)FreeOpen-source

NCP-AAI Exam Tip: Know that higher dimensions ≠ always better. Consider:

  • Latency: 4096-dim embeddings are 2.5x slower than 1024-dim
  • Storage: 4096-dim requires 4x more vector DB storage
  • Quality: Diminishing returns above 1024 dimensions for most tasks

Vector Database Options

Production Vector Databases:

  1. Pinecone (Managed, easiest)

    • Serverless, auto-scaling
    • Best for: Startups, prototypes
    • Cost: $70/month per 100K vectors
  2. Weaviate (Open-source, flexible)

    • Hybrid search (vector + keyword) built-in
    • Best for: Self-hosted, cost-sensitive
    • Cost: Free (self-hosted)
  3. Milvus (High-performance)

    • Handles billions of vectors
    • Best for: Large-scale enterprise
    • Cost: Free (self-hosted) or managed via Zilliz
  4. Chroma (Dev-friendly)

    • Embedded database (no server)
    • Best for: Local dev, prototypes
    • Cost: Free

NCP-AAI Exam Scenario: "Your team needs to store 100M vectors with hybrid search. Which vector DB?" (Answer: Milvus or Weaviate with production deployment)

Indexing Code Example

from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings import OpenAIEmbeddings
from llama_index.vector_stores import PineconeVectorStore
import pinecone

# Initialize vector store
pinecone.init(api_key="your-key", environment="us-west1-gcp")
pinecone_index = pinecone.Index("ncp-aai-docs")
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

# Create index with custom embeddings
embed_model = OpenAIEmbeddings(model="text-embedding-3-large")
index = VectorStoreIndex.from_documents(
    documents,
    vector_store=vector_store,
    embed_model=embed_model
)

# Persist for later use
index.storage_context.persist(persist_dir="./storage")

Stage 4: Retrieval (Query-Time)

Retrieval Method #1: Semantic Search (Baseline)

How it works: Embed query, find K nearest neighbor vectors

# Simple semantic search
query_engine = index.as_query_engine(
    similarity_top_k=5  # Retrieve top 5 most similar chunks
)

response = query_engine.query("What is the NCP-AAI exam structure?")

Pros: Fast, works well for most queries

Cons: May miss exact keyword matches (e.g., product names, codes)

Performance: Baseline (1.0x)

Retrieval Method #2: Hybrid Search (SOTA 2025)

How it works: Combine vector similarity + keyword search (BM25) with fusion

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.retrievers import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever

# Vector retriever
vector_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10
)

# Keyword retriever (BM25)
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore,
    similarity_top_k=10
)

# Fusion retriever (combines both with RRF - Reciprocal Rank Fusion)
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    mode="reciprocal_rerank"  # RRF fusion algorithm
)

# Use in query engine
query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)
response = query_engine.query("NCP-AAI exam Domain 2 percentage")

Pros: 15-25% better recall than pure vector search

Cons: Slightly slower (2 retrievals + fusion)

Performance: 1.2-1.3x retrieval quality

Retrieval Method #3: Reranking (Essential for High Quality)

How it works: Retrieve 20-50 candidates, rerank with cross-encoder model

from llama_index.postprocessor import CohereRerank

# Retrieve more candidates (overgenerate)
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=20  # Retrieve 20 candidates
)

# Rerank to top 5 with cross-encoder
reranker = CohereRerank(
    api_key="your-cohere-key",
    top_n=5  # Return top 5 after reranking
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[reranker]
)

response = query_engine.query("Explain NVIDIA NIM deployment")

How reranking works:

  1. Bi-encoder (fast) retrieves 20 candidates (~50ms)
  2. Cross-encoder (slow but accurate) reranks to top 5 (~200ms)

Pros: 20-30% better precision (fewer irrelevant results)

Cons: Adds 150-250ms latency, costs $0.001 per query (Cohere)

Performance: 1.3-1.4x retrieval quality

NCP-AAI Exam: Retrieval Decision Matrix

Use CaseRecommended Methodtop_kRationale
General Q&AHybrid search5Balance speed & quality
Exact match criticalHybrid + reranking3Legal docs, product codes
Low latency requiredSemantic search3-5Chat applications
High precision neededHybrid + reranking + compression3Customer support, medical
Multi-hop reasoningAgentic RAG (iterative retrieval)5 per hopComplex research tasks

Master These Concepts with Practice

Our NCP-AAI practice bundle includes:

  • 7 full practice exams (455+ questions)
  • Detailed explanations for every answer
  • Domain-by-domain performance tracking

30-day money-back guarantee

Stage 5: Response Generation

Prompt Engineering for RAG

Basic RAG Prompt:

rag_prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""

Advanced RAG Prompt (with citations):

advanced_rag_prompt = """You are an expert assistant. Answer the question using ONLY the provided context.

Context:
{context}

Instructions:
1. Answer based solely on the context above
2. If the context doesn't contain the answer, respond: "The provided documents don't contain this information."
3. Cite sources using [Source X] notation
4. If context is ambiguous, acknowledge uncertainty

Question: {question}

Answer (with citations):"""

Handling Hallucinations

Problem: LLM generates plausible-sounding but false information

Solutions:

from llama_index.core.postprocessor import SimilarityPostprocessor

# 1. Require minimum similarity threshold
similarity_filter = SimilarityPostprocessor(similarity_cutoff=0.7)

# 2. Use structured output for citations
from pydantic import BaseModel, Field
from llama_index.core.program import LLMTextCompletionProgram

class RAGResponse(BaseModel):
    answer: str = Field(description="Answer to the question")
    sources: List[str] = Field(description="List of source document IDs used")
    confidence: float = Field(description="Confidence score 0-1")

program = LLMTextCompletionProgram.from_defaults(
    output_cls=RAGResponse,
    prompt_template_str=advanced_rag_prompt
)

# 3. Implement guardrails
from nemoguardrails import LLMRails

rails_config = """
define flow check_hallucination:
  if bot response not grounded in context:
    bot say "I don't have reliable information on this."
"""

rails = LLMRails.from_config(rails_config)
response = rails.generate(messages=[{"role": "user", "content": query}])

Advanced RAG Techniques (2025)

Technique #1: Query Transformation

Problem: User query may not match document phrasing

Solution: Rewrite/expand query before retrieval

from llama_index.core.indices.query.query_transform import HyDEQueryTransform

# HyDE: Generate hypothetical document, embed it, retrieve similar
hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(base_query_engine, query_transform=hyde)

# Original query: "NCP-AAI exam difficulty"
# HyDE generates: "The NCP-AAI exam is moderately difficult, requiring 8-12 weeks of study..."
# Embeds the hypothetical answer, retrieves similar documents

Technique #2: Multi-Hop Reasoning

Problem: Single retrieval may not contain full answer

Solution: Iteratively retrieve and reason

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool

# Turn query engine into tool for agent
query_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="knowledge_base",
    description="Search company knowledge base"
)

# Agent can retrieve multiple times
agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)

# Query: "Compare NCP-AAI and AWS AI practitioner exam"
# Agent: 1. Retrieve NCP-AAI info, 2. Retrieve AWS info, 3. Compare
response = agent.chat("Compare NCP-AAI and AWS AI practitioner exam")

Technique #3: Context Compression

Problem: Retrieved chunks contain irrelevant information

Solution: Extract only relevant sentences

from llama_index.core.postprocessor import LongContextReorder
from llama_index.postprocessor import SentenceTransformerRerank

# 1. Reorder chunks (most relevant to edges, less relevant in middle)
reorder = LongContextReorder()

# 2. Extract relevant sentences only
from llama_index.core.extractors import SummaryExtractor
compressor = SummaryExtractor(
    summaries=["self"],  # Summarize each chunk
    llm=llm
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[reorder, compressor]
)

RAG Evaluation Metrics

Key Metrics for NCP-AAI

1. Retrieval Quality:

  • Recall@K: % of relevant docs in top K results (target: >85%)
  • Precision@K: % of retrieved docs that are relevant (target: >70%)
  • MRR (Mean Reciprocal Rank): 1/rank of first relevant doc (target: >0.7)

2. Generation Quality:

  • Answer Relevancy: How relevant is answer to question? (target: >0.8)
  • Faithfulness: Does answer match context? (target: >0.9)
  • Context Relevancy: Is retrieved context relevant? (target: >0.75)

3. System Performance:

  • Latency: Time from query to response (target: <2 seconds)
  • Throughput: Queries per second (target: >50 QPS)
  • Cost: $ per 1000 queries (target: <$0.50)

RAG Evaluation Code

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    BatchEvalRunner
)

# Initialize evaluators
faithfulness_evaluator = FaithfulnessEvaluator(llm=llm)
relevancy_evaluator = RelevancyEvaluator(llm=llm)

# Create evaluation dataset
eval_questions = [
    "What is the NCP-AAI exam duration?",
    "How many questions are in NCP-AAI?",
    # ... more questions
]

# Run batch evaluation
runner = BatchEvalRunner(
    evaluators={"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8
)

eval_results = await runner.aevaluate_queries(
    query_engine=query_engine,
    queries=eval_questions
)

# Results: {query: {"faithfulness": 0.92, "relevancy": 0.85}, ...}

NCP-AAI Exam Preparation Tips

High-Probability RAG Questions

1. Chunking Strategy Selection:

  • "Your team is building a RAG system for legal contracts. Which chunking strategy?" (Answer: Agentic or document-based with section preservation)
  • "What is the primary advantage of semantic chunking?" (Answer: Preserves semantic coherence, avoids splitting concepts)

2. Retrieval Optimization:

  • "How can you improve RAG retrieval quality by 20-30% without changing the embedding model?" (Answer: Implement hybrid search + reranking)
  • "What technique addresses the cold start problem where initial retrieval misses relevant documents?" (Answer: Query transformation like HyDE)

3. Performance Troubleshooting:

  • "RAG system retrieves irrelevant chunks 40% of the time. What's the most likely cause?" (Answer: Poor chunking strategy or chunks too large)
  • "How to reduce RAG latency from 3 seconds to under 1 second?" (Answer: Reduce top_k, use smaller embedding model, cache frequent queries)

Hands-On Practice Checklist

Week 1-2:

  • Build basic RAG system with fixed-size chunking
  • Experiment with chunk sizes (256, 512, 1024 tokens)
  • Compare 3 embedding models (OpenAI, Cohere, open-source)

Week 3-4:

  • Implement semantic chunking
  • Add hybrid search (vector + keyword)
  • Integrate reranking with Cohere or cross-encoder

Week 5-6:

  • Build agentic RAG with multi-hop reasoning
  • Implement query transformation (HyDE)
  • Add guardrails for hallucination prevention
  • Run evaluation on test queries

Preporato's NCP-AAI Practice Exams

Master RAG systems and all NCP-AAI domains with Preporato's 7 full-length practice exams:

  • RAG scenario questions testing chunking, retrieval, and optimization
  • Hands-on RAG challenges with real-world architectures
  • Detailed explanations comparing approaches (semantic vs. fixed-size, hybrid vs. pure vector)
  • Performance tracking by Domain 2 (Knowledge Integration)
  • $49 for all 7 exams (vs. $200 exam retake fee)

95% of Preporato users pass NCP-AAI on their first attempt. Get started today at Preporato.com!

Conclusion

RAG is the foundation of knowledge-grounded agentic AI systems and a critical component of the NCP-AAI certification. To excel in the exam and production deployments, master:

  1. Chunking strategies: Fixed-size (baseline), semantic (SOTA), document-based, agentic
  2. Retrieval methods: Semantic search, hybrid search, reranking
  3. Advanced techniques: Query transformation, multi-hop reasoning, context compression
  4. Evaluation metrics: Recall, precision, faithfulness, relevancy
  5. Production optimization: Latency, cost, accuracy tradeoffs

Key takeaway: RAG quality is 40% chunking, 30% retrieval method, 20% generation prompt, 10% embedding model. Focus your optimization efforts accordingly.

Ready to master RAG and ace the NCP-AAI certification? Start practicing with Preporato's comprehensive exam prep platform today!


Frequently Asked Questions

Q: What's the optimal chunk size for RAG systems? A: 512 tokens is the sweet spot for most use cases. Use 256 for precise retrieval, 1024 for broader context. Always experiment with your specific documents.

Q: Should I use semantic or fixed-size chunking? A: Start with fixed-size (faster, simpler). Upgrade to semantic if retrieval quality is insufficient (15-25% improvement). Semantic chunking is slower but worth it for high-quality RAG.

Q: How many chunks should I retrieve (top_k)? A: 3-5 for most use cases. Use 5-10 if adding reranking. More isn't always better—too much context confuses the LLM.

Q: What's the difference between RAG and fine-tuning? A: RAG retrieves external knowledge at query time. Fine-tuning bakes knowledge into model weights. Use RAG for dynamic knowledge, fine-tuning for style/format.

Q: Can RAG work with real-time data? A: Yes. Use streaming indices (LlamaIndex) or incremental updates. For ultra-real-time (seconds), consider direct API calls instead of indexing.

Q: How to prevent RAG hallucinations? A: (1) Strong prompt: "Answer only from context", (2) Similarity threshold (>0.7), (3) Guardrails, (4) Structured output with citations.

Q: What's the cost of running RAG at scale? A: $0.10-$0.50 per 1000 queries (embedding + LLM generation + reranking). Use caching and smaller models to reduce costs 50-70%.

Q: Does RAG require GPU? A: No. Embedding and retrieval run on CPU. Only LLM generation benefits from GPU (but can use API providers like OpenAI/Anthropic).

Ready to Pass the NCP-AAI Exam?

Join thousands who passed with Preporato practice tests

Instant access30-day guaranteeUpdated monthly