Retrieval-Augmented Generation (RAG) is the backbone of modern agentic AI systems and a critical component of the NCP-AAI certification exam. RAG appears throughout Domain 2 (Knowledge Integration and Agent Development—15% of the exam) and is tested in practical scenarios across all domains. This comprehensive guide covers everything you need to master RAG for the NCP-AAI exam and production deployments.
Quick Takeaways
- RAG = Retrieval + Generation: Combine external knowledge retrieval with LLM generation
- 15% of NCP-AAI exam: Domain 2 focuses heavily on RAG pipeline design and optimization
- 3 core components: Document processing, retrieval, and response synthesis
- Chunking strategy: Single most important factor for RAG performance (30-40% impact)
- Hybrid search: Combining vector + keyword search improves accuracy by 15-25%
- 2025 best practice: Semantic chunking + reranking + agentic RAG patterns
Preparing for NCP-AAI? Practice with 455+ exam questions
What is RAG? (NCP-AAI Definition)
Core Concept
Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Model (LLM) responses by retrieving relevant information from external knowledge sources before generating answers.
Without RAG:
User Query → LLM → Response (limited to training data)
With RAG:
User Query → Retrieve Relevant Docs → LLM + Retrieved Context → Accurate Response
Why RAG Matters for Agentic AI
1. Overcomes LLM Limitations:
- Knowledge cutoff: LLMs only know data from training (e.g., GPT-4 trained on data up to April 2024)
- Hallucinations: LLMs generate plausible-sounding but incorrect information
- Domain specificity: General LLMs lack specialized company/industry knowledge
2. Essential for Intelligent Agents:
- Long-term memory: Agents retrieve from past conversations and experiences
- Grounded responses: Agents cite sources and provide verifiable information
- Dynamic knowledge: Agents access up-to-date information without retraining
3. Production Requirements:
- Privacy: Keep proprietary data on-premises (not in LLM training data)
- Cost: Cheaper than fine-tuning LLMs for each knowledge domain
- Maintainability: Update knowledge base without retraining models
RAG Pipeline Architecture
Standard RAG Pipeline (5 Stages)
┌─────────────────────────────────────────────────────────────┐
│ 1. DATA INGESTION │
│ Documents (PDF, SQL, APIs) → Load → Parse │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 2. CHUNKING │
│ Full Documents → Split → Chunks (with overlap) │
│ Strategy: Semantic / Fixed-size / Document-based │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 3. EMBEDDING & INDEXING │
│ Chunks → Embedding Model → Vectors → Vector Database │
│ (e.g., NV-Embed-v2, text-embedding-3-large) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 4. RETRIEVAL (Query-Time) │
│ User Query → Embed Query → Search Vector DB → Top-K Chunks │
│ Optional: Reranking, Hybrid Search │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 5. GENERATION │
│ Query + Retrieved Chunks → LLM → Final Response │
│ Prompt Engineering: "Use only provided context..." │
└─────────────────────────────────────────────────────────────┘
Advanced RAG Pipeline (2025 Best Practices)
User Query → Query Transformation (rewrite, expand)
↓
Hybrid Retrieval (Vector + Keyword + Knowledge Graph)
↓
Reranking (Reorder by relevance score)
↓
Context Compression (Remove irrelevant parts)
↓
Multi-hop Reasoning (Follow-up retrieval if needed)
↓
Response Generation (with citations)
↓
Guardrails & Validation (check for hallucinations)
Stage 1: Document Processing and Ingestion
Data Source Types
Structured Data:
- SQL Databases: PostgreSQL, MySQL, Oracle
- NoSQL: MongoDB, Cassandra, DynamoDB
- Data Warehouses: Snowflake, BigQuery, Redshift
Unstructured Data:
- Documents: PDF, DOCX, TXT, Markdown
- Web Content: HTML pages, wikis, documentation
- Code Repositories: GitHub, GitLab, Bitbucket
Semi-Structured Data:
- APIs: REST, GraphQL, gRPC
- Messaging: Slack, Discord, email archives
- Collaboration Tools: Notion, Confluence, SharePoint
Document Parsing Best Practices
Challenge: Extract clean text from complex documents (PDFs, tables, images)
Solutions:
# Basic PDF parsing
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_dir="./docs",
required_exts=[".pdf", ".docx", ".txt"]
).load_data()
# Advanced: Parse tables and images
from llama_index.readers.file import PyMuPDFReader
# Preserves table structure and extracts images
reader = PyMuPDFReader()
documents = reader.load_data(file_path="complex_report.pdf")
NCP-AAI Exam Tip: Know which parser to use for different document types:
- PDFs with tables: PyMuPDF or Unstructured
- HTML/Web: BeautifulSoup or Trafilatura
- Code files: Tree-sitter (preserves syntax structure)
- Images/Scans: OCR (Tesseract, AWS Textract) → Text extraction
Stage 2: Chunking Strategies (Most Critical Decision)
Why Chunking Matters
Chunking is the #1 factor impacting RAG performance (30-40% of retrieval quality).
Chunking Tradeoff:
- Too large: Vector loses specificity, retrieves irrelevant context
- Too small: Loses context, incomplete information for LLM
Chunking Strategy #1: Fixed-Size Chunking (Baseline)
Description: Split text into chunks of fixed token/character count with overlap
Best for: General-purpose RAG, when documents lack clear structure
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # ~512 tokens (good for most embeddings)
chunk_overlap=50, # 10% overlap to preserve context
length_function=len,
separators=["\n\n", "\n", " ", ""] # Respect paragraph boundaries
)
chunks = splitter.split_documents(documents)
Pros:
- Simple, fast, predictable
- Works with any document type
Cons:
- May split sentences or concepts mid-thought
- Doesn't respect document structure (headings, sections)
Performance: Baseline (1.0x retrieval quality)
Chunking Strategy #2: Semantic Chunking (SOTA 2025)
Description: Dynamically split based on semantic coherence using embeddings
Best for: High-quality RAG where context preservation is critical
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings
# Uses embedding similarity to detect topic boundaries
semantic_splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile", # or "standard_deviation"
breakpoint_threshold_amount=85 # Split at 85th percentile similarity drop
)
chunks = semantic_splitter.split_documents(documents)
How it works:
- Embed each sentence
- Calculate similarity between consecutive sentences
- Split when similarity drops significantly (topic change detected)
Pros:
- Preserves semantic coherence (each chunk discusses one topic)
- 15-25% better retrieval quality than fixed-size
Cons:
- Slower (requires embedding every sentence)
- Variable chunk sizes (may exceed context window)
Performance: 1.2-1.3x retrieval quality vs. fixed-size
Chunking Strategy #3: Document-Based Chunking
Description: Split based on document structure (headings, sections, paragraphs)
Best for: Structured documents (Markdown, HTML, code files)
from langchain.text_splitter import MarkdownHeaderTextSplitter
# Split Markdown by headers (preserves hierarchy)
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = markdown_splitter.split_text(markdown_document)
# Each chunk includes header hierarchy as metadata
# Example: {"Header 1": "NCP-AAI Guide", "Header 2": "RAG Systems"}
Pros:
- Respects author's intended structure
- Metadata enrichment (section titles, hierarchy)
Cons:
- Only works for well-structured documents
- Chunk size highly variable
Performance: 1.15-1.25x retrieval quality (when structure is meaningful)
Chunking Strategy #4: Agentic Chunking (Emerging 2025)
Description: Use LLM to intelligently determine chunk boundaries
Best for: Complex documents requiring human-like understanding
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# LLM determines optimal chunk boundaries
agentic_prompt = PromptTemplate(
template="""Analyze this text and determine logical chunk boundaries
where topics change. Mark boundaries with [SPLIT].
Text: {text}
Output the text with [SPLIT] markers:"""
)
agentic_chunker = LLMChain(llm=llm, prompt=agentic_prompt)
marked_text = agentic_chunker.run(document.page_content)
chunks = marked_text.split("[SPLIT]")
Pros:
- Highest semantic quality (simulates human chunking)
- Handles complex documents (legal, technical, narrative)
Cons:
- Expensive (LLM call per document)
- Slow (not suitable for real-time ingestion)
Performance: 1.3-1.4x retrieval quality (highest, but costly)
NCP-AAI Exam: Chunking Decision Matrix
| Document Type | Recommended Strategy | Chunk Size | Overlap |
|---|---|---|---|
| General text | Fixed-size | 512 tokens | 50 tokens (10%) |
| Technical docs | Semantic | 300-800 tokens | N/A (semantic) |
| Structured (MD, HTML) | Document-based | Variable | N/A |
| Legal/contracts | Agentic | Variable | N/A |
| Code repositories | Document-based (by function) | 50-200 lines | 10 lines |
| Chat transcripts | Fixed-size with timestamp metadata | 10-20 messages | 2 messages |
Stage 3: Embedding and Indexing
Embedding Model Selection (2025)
Best Embedding Models for NCP-AAI:
| Model | Dimensions | Performance | Cost | Best For |
|---|---|---|---|---|
| NV-Embed-v2 | 4096 | SOTA (72.3 MTEB) | Medium | NVIDIA ecosystem |
| text-embedding-3-large | 3072 | Excellent (64.6 MTEB) | Low | General-purpose |
| text-embedding-3-small | 1536 | Good (62.3 MTEB) | Very Low | Budget/speed |
| Cohere embed-v3 | 1024 | Excellent (64.5 MTEB) | Medium | Multilingual |
| BGE-large-en-v1.5 | 1024 | Good (63.9 MTEB) | Free | Open-source |
NCP-AAI Exam Tip: Know that higher dimensions ≠ always better. Consider:
- Latency: 4096-dim embeddings are 2.5x slower than 1024-dim
- Storage: 4096-dim requires 4x more vector DB storage
- Quality: Diminishing returns above 1024 dimensions for most tasks
Vector Database Options
Production Vector Databases:
-
Pinecone (Managed, easiest)
- Serverless, auto-scaling
- Best for: Startups, prototypes
- Cost: $70/month per 100K vectors
-
Weaviate (Open-source, flexible)
- Hybrid search (vector + keyword) built-in
- Best for: Self-hosted, cost-sensitive
- Cost: Free (self-hosted)
-
Milvus (High-performance)
- Handles billions of vectors
- Best for: Large-scale enterprise
- Cost: Free (self-hosted) or managed via Zilliz
-
Chroma (Dev-friendly)
- Embedded database (no server)
- Best for: Local dev, prototypes
- Cost: Free
NCP-AAI Exam Scenario: "Your team needs to store 100M vectors with hybrid search. Which vector DB?" (Answer: Milvus or Weaviate with production deployment)
Indexing Code Example
from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings import OpenAIEmbeddings
from llama_index.vector_stores import PineconeVectorStore
import pinecone
# Initialize vector store
pinecone.init(api_key="your-key", environment="us-west1-gcp")
pinecone_index = pinecone.Index("ncp-aai-docs")
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
# Create index with custom embeddings
embed_model = OpenAIEmbeddings(model="text-embedding-3-large")
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store,
embed_model=embed_model
)
# Persist for later use
index.storage_context.persist(persist_dir="./storage")
Stage 4: Retrieval (Query-Time)
Retrieval Method #1: Semantic Search (Baseline)
How it works: Embed query, find K nearest neighbor vectors
# Simple semantic search
query_engine = index.as_query_engine(
similarity_top_k=5 # Retrieve top 5 most similar chunks
)
response = query_engine.query("What is the NCP-AAI exam structure?")
Pros: Fast, works well for most queries
Cons: May miss exact keyword matches (e.g., product names, codes)
Performance: Baseline (1.0x)
Retrieval Method #2: Hybrid Search (SOTA 2025)
How it works: Combine vector similarity + keyword search (BM25) with fusion
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.retrievers import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
# Vector retriever
vector_retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10
)
# Keyword retriever (BM25)
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore,
similarity_top_k=10
)
# Fusion retriever (combines both with RRF - Reciprocal Rank Fusion)
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
similarity_top_k=5,
mode="reciprocal_rerank" # RRF fusion algorithm
)
# Use in query engine
query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)
response = query_engine.query("NCP-AAI exam Domain 2 percentage")
Pros: 15-25% better recall than pure vector search
Cons: Slightly slower (2 retrievals + fusion)
Performance: 1.2-1.3x retrieval quality
Retrieval Method #3: Reranking (Essential for High Quality)
How it works: Retrieve 20-50 candidates, rerank with cross-encoder model
from llama_index.postprocessor import CohereRerank
# Retrieve more candidates (overgenerate)
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=20 # Retrieve 20 candidates
)
# Rerank to top 5 with cross-encoder
reranker = CohereRerank(
api_key="your-cohere-key",
top_n=5 # Return top 5 after reranking
)
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[reranker]
)
response = query_engine.query("Explain NVIDIA NIM deployment")
How reranking works:
- Bi-encoder (fast) retrieves 20 candidates (~50ms)
- Cross-encoder (slow but accurate) reranks to top 5 (~200ms)
Pros: 20-30% better precision (fewer irrelevant results)
Cons: Adds 150-250ms latency, costs $0.001 per query (Cohere)
Performance: 1.3-1.4x retrieval quality
NCP-AAI Exam: Retrieval Decision Matrix
| Use Case | Recommended Method | top_k | Rationale |
|---|---|---|---|
| General Q&A | Hybrid search | 5 | Balance speed & quality |
| Exact match critical | Hybrid + reranking | 3 | Legal docs, product codes |
| Low latency required | Semantic search | 3-5 | Chat applications |
| High precision needed | Hybrid + reranking + compression | 3 | Customer support, medical |
| Multi-hop reasoning | Agentic RAG (iterative retrieval) | 5 per hop | Complex research tasks |
Master These Concepts with Practice
Our NCP-AAI practice bundle includes:
- 7 full practice exams (455+ questions)
- Detailed explanations for every answer
- Domain-by-domain performance tracking
30-day money-back guarantee
Stage 5: Response Generation
Prompt Engineering for RAG
Basic RAG Prompt:
rag_prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
Advanced RAG Prompt (with citations):
advanced_rag_prompt = """You are an expert assistant. Answer the question using ONLY the provided context.
Context:
{context}
Instructions:
1. Answer based solely on the context above
2. If the context doesn't contain the answer, respond: "The provided documents don't contain this information."
3. Cite sources using [Source X] notation
4. If context is ambiguous, acknowledge uncertainty
Question: {question}
Answer (with citations):"""
Handling Hallucinations
Problem: LLM generates plausible-sounding but false information
Solutions:
from llama_index.core.postprocessor import SimilarityPostprocessor
# 1. Require minimum similarity threshold
similarity_filter = SimilarityPostprocessor(similarity_cutoff=0.7)
# 2. Use structured output for citations
from pydantic import BaseModel, Field
from llama_index.core.program import LLMTextCompletionProgram
class RAGResponse(BaseModel):
answer: str = Field(description="Answer to the question")
sources: List[str] = Field(description="List of source document IDs used")
confidence: float = Field(description="Confidence score 0-1")
program = LLMTextCompletionProgram.from_defaults(
output_cls=RAGResponse,
prompt_template_str=advanced_rag_prompt
)
# 3. Implement guardrails
from nemoguardrails import LLMRails
rails_config = """
define flow check_hallucination:
if bot response not grounded in context:
bot say "I don't have reliable information on this."
"""
rails = LLMRails.from_config(rails_config)
response = rails.generate(messages=[{"role": "user", "content": query}])
Advanced RAG Techniques (2025)
Technique #1: Query Transformation
Problem: User query may not match document phrasing
Solution: Rewrite/expand query before retrieval
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
# HyDE: Generate hypothetical document, embed it, retrieve similar
hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(base_query_engine, query_transform=hyde)
# Original query: "NCP-AAI exam difficulty"
# HyDE generates: "The NCP-AAI exam is moderately difficult, requiring 8-12 weeks of study..."
# Embeds the hypothetical answer, retrieves similar documents
Technique #2: Multi-Hop Reasoning
Problem: Single retrieval may not contain full answer
Solution: Iteratively retrieve and reason
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool
# Turn query engine into tool for agent
query_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
name="knowledge_base",
description="Search company knowledge base"
)
# Agent can retrieve multiple times
agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)
# Query: "Compare NCP-AAI and AWS AI practitioner exam"
# Agent: 1. Retrieve NCP-AAI info, 2. Retrieve AWS info, 3. Compare
response = agent.chat("Compare NCP-AAI and AWS AI practitioner exam")
Technique #3: Context Compression
Problem: Retrieved chunks contain irrelevant information
Solution: Extract only relevant sentences
from llama_index.core.postprocessor import LongContextReorder
from llama_index.postprocessor import SentenceTransformerRerank
# 1. Reorder chunks (most relevant to edges, less relevant in middle)
reorder = LongContextReorder()
# 2. Extract relevant sentences only
from llama_index.core.extractors import SummaryExtractor
compressor = SummaryExtractor(
summaries=["self"], # Summarize each chunk
llm=llm
)
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[reorder, compressor]
)
RAG Evaluation Metrics
Key Metrics for NCP-AAI
1. Retrieval Quality:
- Recall@K: % of relevant docs in top K results (target: >85%)
- Precision@K: % of retrieved docs that are relevant (target: >70%)
- MRR (Mean Reciprocal Rank): 1/rank of first relevant doc (target: >0.7)
2. Generation Quality:
- Answer Relevancy: How relevant is answer to question? (target: >0.8)
- Faithfulness: Does answer match context? (target: >0.9)
- Context Relevancy: Is retrieved context relevant? (target: >0.75)
3. System Performance:
- Latency: Time from query to response (target: <2 seconds)
- Throughput: Queries per second (target: >50 QPS)
- Cost: $ per 1000 queries (target: <$0.50)
RAG Evaluation Code
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
BatchEvalRunner
)
# Initialize evaluators
faithfulness_evaluator = FaithfulnessEvaluator(llm=llm)
relevancy_evaluator = RelevancyEvaluator(llm=llm)
# Create evaluation dataset
eval_questions = [
"What is the NCP-AAI exam duration?",
"How many questions are in NCP-AAI?",
# ... more questions
]
# Run batch evaluation
runner = BatchEvalRunner(
evaluators={"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
workers=8
)
eval_results = await runner.aevaluate_queries(
query_engine=query_engine,
queries=eval_questions
)
# Results: {query: {"faithfulness": 0.92, "relevancy": 0.85}, ...}
NCP-AAI Exam Preparation Tips
High-Probability RAG Questions
1. Chunking Strategy Selection:
- "Your team is building a RAG system for legal contracts. Which chunking strategy?" (Answer: Agentic or document-based with section preservation)
- "What is the primary advantage of semantic chunking?" (Answer: Preserves semantic coherence, avoids splitting concepts)
2. Retrieval Optimization:
- "How can you improve RAG retrieval quality by 20-30% without changing the embedding model?" (Answer: Implement hybrid search + reranking)
- "What technique addresses the cold start problem where initial retrieval misses relevant documents?" (Answer: Query transformation like HyDE)
3. Performance Troubleshooting:
- "RAG system retrieves irrelevant chunks 40% of the time. What's the most likely cause?" (Answer: Poor chunking strategy or chunks too large)
- "How to reduce RAG latency from 3 seconds to under 1 second?" (Answer: Reduce top_k, use smaller embedding model, cache frequent queries)
Hands-On Practice Checklist
Week 1-2:
- Build basic RAG system with fixed-size chunking
- Experiment with chunk sizes (256, 512, 1024 tokens)
- Compare 3 embedding models (OpenAI, Cohere, open-source)
Week 3-4:
- Implement semantic chunking
- Add hybrid search (vector + keyword)
- Integrate reranking with Cohere or cross-encoder
Week 5-6:
- Build agentic RAG with multi-hop reasoning
- Implement query transformation (HyDE)
- Add guardrails for hallucination prevention
- Run evaluation on test queries
Preporato's NCP-AAI Practice Exams
Master RAG systems and all NCP-AAI domains with Preporato's 7 full-length practice exams:
- RAG scenario questions testing chunking, retrieval, and optimization
- Hands-on RAG challenges with real-world architectures
- Detailed explanations comparing approaches (semantic vs. fixed-size, hybrid vs. pure vector)
- Performance tracking by Domain 2 (Knowledge Integration)
- $49 for all 7 exams (vs. $200 exam retake fee)
95% of Preporato users pass NCP-AAI on their first attempt. Get started today at Preporato.com!
Conclusion
RAG is the foundation of knowledge-grounded agentic AI systems and a critical component of the NCP-AAI certification. To excel in the exam and production deployments, master:
- Chunking strategies: Fixed-size (baseline), semantic (SOTA), document-based, agentic
- Retrieval methods: Semantic search, hybrid search, reranking
- Advanced techniques: Query transformation, multi-hop reasoning, context compression
- Evaluation metrics: Recall, precision, faithfulness, relevancy
- Production optimization: Latency, cost, accuracy tradeoffs
Key takeaway: RAG quality is 40% chunking, 30% retrieval method, 20% generation prompt, 10% embedding model. Focus your optimization efforts accordingly.
Ready to master RAG and ace the NCP-AAI certification? Start practicing with Preporato's comprehensive exam prep platform today!
Frequently Asked Questions
Q: What's the optimal chunk size for RAG systems? A: 512 tokens is the sweet spot for most use cases. Use 256 for precise retrieval, 1024 for broader context. Always experiment with your specific documents.
Q: Should I use semantic or fixed-size chunking? A: Start with fixed-size (faster, simpler). Upgrade to semantic if retrieval quality is insufficient (15-25% improvement). Semantic chunking is slower but worth it for high-quality RAG.
Q: How many chunks should I retrieve (top_k)? A: 3-5 for most use cases. Use 5-10 if adding reranking. More isn't always better—too much context confuses the LLM.
Q: What's the difference between RAG and fine-tuning? A: RAG retrieves external knowledge at query time. Fine-tuning bakes knowledge into model weights. Use RAG for dynamic knowledge, fine-tuning for style/format.
Q: Can RAG work with real-time data? A: Yes. Use streaming indices (LlamaIndex) or incremental updates. For ultra-real-time (seconds), consider direct API calls instead of indexing.
Q: How to prevent RAG hallucinations? A: (1) Strong prompt: "Answer only from context", (2) Similarity threshold (>0.7), (3) Guardrails, (4) Structured output with citations.
Q: What's the cost of running RAG at scale? A: $0.10-$0.50 per 1000 queries (embedding + LLM generation + reranking). Use caching and smaller models to reduce costs 50-70%.
Q: Does RAG require GPU? A: No. Embedding and retrieval run on CPU. Only LLM generation benefits from GPU (but can use API providers like OpenAI/Anthropic).
Ready to Pass the NCP-AAI Exam?
Join thousands who passed with Preporato practice tests
