Print-friendly | Download PDF | Save for Exam Day
This comprehensive cheat sheet covers all 10 exam domains with architecture patterns, comparison tables, and code snippets. Based on the official NVIDIA NCP-AAI exam guide (15/15/13/13/10/10/7/5/5/5 weighting).
Domain 1: Agent Architecture and Design (15%)
Agent Architecture Patterns
| Pattern | How It Works | Best For | Trade-offs |
|---|---|---|---|
| ReAct | Interleaved Reasoning + Action loops | Dynamic tasks with tools | Flexible but can loop; higher latency |
| Plan-and-Execute | Create full plan → execute steps | Well-defined multi-step tasks | Efficient but brittle to plan changes |
| Reflexion | Execute → Self-evaluate → Retry | Accuracy-critical tasks | Higher accuracy but 2-3x more LLM calls |
| LATS | Monte Carlo Tree Search for planning | Complex optimization tasks | Best accuracy, highest compute cost |
| Tool-Only | Direct tool routing, minimal reasoning | Simple tool dispatch | Fast but limited reasoning capability |
When to use which:
Task is dynamic with unknown steps? → ReAct
Task is well-defined and sequential? → Plan-and-Execute
Accuracy is critical, latency flexible? → Reflexion
Multiple valid solution paths exist? → LATS
Simple tool routing, no reasoning? → Tool-Only
Single-Agent vs Multi-Agent
| Factor | Single Agent | Multi-Agent |
|---|---|---|
| Use when | Task is linear, <5 tools | Task has distinct sub-domains |
| Complexity | Low | High (coordination overhead) |
| Latency | Lower | Higher (message passing) |
| Scalability | Limited | Better (parallel execution) |
| Debugging | Easier | Harder (distributed state) |
Multi-Agent Orchestration Patterns:
Sequential: Agent A → Agent B → Agent C (pipeline)
Parallel: Agent A ↗ Agent B ↗ Agent C (fan-out, merge)
Hierarchical: Orchestrator → [Worker A, Worker B, Worker C]
Collaborative: Agents negotiate and share state
Agent Communication
| Protocol | Pattern | Use Case |
|---|---|---|
| Direct messaging | Agent-to-agent | Small teams, low latency |
| Publish-subscribe | Event-driven | Loose coupling, scalability |
| Shared state | Blackboard pattern | Collaborative problem-solving |
| A2A Protocol | Cross-platform | Interoperability between frameworks |
Preparing for NCP-AAI? Practice with 455+ exam questions
Domain 2: Agent Development (15%)
Tool/Function Calling
# OpenAI-compatible function calling format
tools = [{
"type": "function",
"function": {
"name": "search_database",
"description": "Search product database by query",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
}]
# Always validate tool parameters before execution
def safe_tool_call(tool_name, params):
validated = validate_params(tool_name, params)
try:
result = execute_tool(tool_name, validated)
return result
except ToolError as e:
return fallback_response(tool_name, e)
Error Handling Patterns
| Pattern | When to Use | Implementation |
|---|---|---|
| Retry with backoff | Transient failures (API timeouts) | Exponential: 1s → 2s → 4s → 8s, max 3 retries |
| Circuit breaker | Repeated failures from same tool | Open after 3 failures, half-open after 30s |
| Fallback | Primary tool unavailable | Switch to alternative tool or graceful message |
| Graceful degradation | Non-critical tool failure | Continue with partial information |
# Circuit breaker pattern
class CircuitBreaker:
def __init__(self, failure_threshold=3, reset_timeout=30):
self.failures = 0
self.threshold = failure_threshold
self.state = "closed" # closed → open → half-open
def call(self, func, *args):
if self.state == "open":
if time_since_open > self.reset_timeout:
self.state = "half-open"
else:
return fallback()
try:
result = func(*args)
self.failures = 0
self.state = "closed"
return result
except Exception:
self.failures += 1
if self.failures >= self.threshold:
self.state = "open"
raise
Agent Frameworks Comparison
| Framework | Best For | Multi-Agent | Key Feature |
|---|---|---|---|
| LangChain | General agents | Via LangGraph | Largest ecosystem, most tools |
| LlamaIndex | RAG-heavy agents | Limited | Best RAG integration |
| AutoGen | Multi-agent chat | Native | Conversational agent teams |
| CrewAI | Role-based teams | Native | Role + goal + backstory agents |
| LangGraph | Stateful workflows | Native | Graph-based agent orchestration |
Domain 3: Evaluation and Tuning (13%)
Agent Evaluation Metrics
| Metric | What It Measures | Target Range |
|---|---|---|
| Task completion rate | % of tasks fully completed | >85% for production |
| Reasoning accuracy | Correctness of intermediate steps | >90% for critical tasks |
| End-to-end latency | Total time from input to output | <5s for interactive, <30s for async |
| Cost per interaction | Total LLM + tool API costs | Monitor trend, set budget alerts |
| Tool selection accuracy | % of correct tool choices | >90% |
| Hallucination rate | % of unsupported claims | <5% with RAG |
| User satisfaction (CSAT) | User-reported quality | >4.0/5.0 |
A/B Testing for Agents
1. Define hypothesis: "ReAct with CoT outperforms vanilla ReAct"
2. Split traffic: 50/50 random assignment
3. Measure: Task completion, latency, cost, user satisfaction
4. Duration: Minimum 1000 interactions per variant
5. Statistical significance: p < 0.05
6. Decision: Roll out winner, document learnings
Fine-Tuning Decision Guide
| Scenario | Approach | Why |
|---|---|---|
| Agent needs domain vocabulary | LoRA fine-tune | Adapts to terminology without full retrain |
| Agent formatting is inconsistent | Prompt engineering first | Cheaper, faster iteration |
| Tool selection is poor | Fine-tune on tool-use dataset | Improves function calling accuracy |
| General quality is low | Upgrade base model | Fine-tuning can't fix weak foundations |
Domain 4: Deployment and Scaling (13%)
Containerized Agent Deployment
# docker-compose.yml for agent system
services:
agent-api:
image: agent-service:latest
deploy:
replicas: 3
resources:
reservations:
devices:
- capabilities: [gpu]
environment:
- MODEL_ENDPOINT=http://nim-server:8000
- VECTOR_DB_URL=http://chromadb:8000
nim-server:
image: nvcr.io/nim/meta/llama-3-8b-instruct:latest
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
ports:
- "8000:8000"
Scaling Strategies
| Strategy | When to Use | Implementation |
|---|---|---|
| Horizontal | More concurrent users | Add agent replicas behind load balancer |
| Vertical | Larger models, more memory | Upgrade GPU (A100 → H100) |
| Auto-scaling | Variable load patterns | Scale on queue depth or latency metrics |
| GPU sharing | Multiple small models | Triton multi-model serving |
Deployment Strategies
| Strategy | Risk | Rollback | Use When |
|---|---|---|---|
| Blue/Green | Low | Instant switch | Major agent updates, new models |
| Canary | Very Low | Fast | Gradual rollout, measure impact |
| Rolling | Medium | Slow | Minor updates, stateless services |
| Shadow | None | N/A | Testing new agent in parallel |
Domain 5: Cognition, Planning, and Memory (10%)
Reasoning Frameworks
| Framework | Mechanism | Best For | Latency |
|---|---|---|---|
| Chain-of-Thought (CoT) | Step-by-step reasoning | Linear problems, math | 1.5-2x base |
| Tree-of-Thoughts (ToT) | Branching exploration | Creative/strategic tasks | 3-5x base |
| ReAct Reasoning | Thought → Action → Observation | Tool-using agents | 2-3x base |
| MCTS | Monte Carlo search over plans | Optimization problems | 5-10x base |
| Self-Consistency | Multiple CoT, majority vote | High-stakes decisions | 3-5x base |
CoT: Think → Think → Think → Answer (linear)
ToT: Think → Branch → Evaluate → Select → Think (tree)
ReAct: Think → Act → Observe → Think → Act (loop)
MCTS: Simulate → Evaluate → Backprop → Select (search)
Memory Systems
| Memory Type | Storage | Duration | Use Case |
|---|---|---|---|
| Short-term | Context window | Current session | Active conversation |
| Long-term | Vector database | Persistent | User preferences, past interactions |
| Episodic | Key-value store | Persistent | Specific past events and outcomes |
| Semantic | Knowledge graph | Persistent | Factual knowledge, relationships |
| Working | Scratchpad | Current task | Intermediate reasoning steps |
Memory Architecture Pattern:
User Input → Short-term Memory (context window)
→ Retrieve from Long-term Memory (vector DB)
→ Check Episodic Memory (similar past scenarios)
→ Agent Processes with Working Memory
→ Store important results to Long-term Memory
Context Window Management
# Token budget allocation
total_tokens = 8192 # Model context window
system_prompt = 500 # ~6% - Agent instructions
retrieved_docs = 3000 # ~37% - RAG context
conversation = 2000 # ~24% - Chat history
working_memory = 1000 # ~12% - Scratchpad
output_reserve = 1692 # ~21% - Generation space
# Compression strategies when exceeding budget:
# 1. Summarize older conversation turns
# 2. Reduce retrieved docs (top-3 → top-2)
# 3. Compress working memory
Domain 6: Knowledge Integration and Data Handling (10%)
RAG Pipeline
Documents → Chunk → Embed → Store (Vector DB)
↓
Query → Embed → Search → Retrieve Top-K → Rerank → Augment Prompt → LLM → Response
Chunking Strategies
| Strategy | Chunk Size | Overlap | Best For |
|---|---|---|---|
| Fixed-size | 512 tokens | 10-15% | General purpose, fast |
| Semantic | Variable | At topic breaks | Long documents, mixed topics |
| Recursive | 512-1024 | 15-20% | Structured docs (markdown, code) |
| Document-level | Full doc | N/A | Short documents, FAQs |
Best practices:
- Preserve sentence boundaries
- Include metadata (source, page, date)
- Test with your domain data
- Smaller chunks = more precise, less context
- Larger chunks = more context, less precise
Vector Database Comparison
| Database | Hosting | Scalability | Best For |
|---|---|---|---|
| ChromaDB | Self-hosted | Small-medium | Prototyping, local dev |
| Pinecone | Managed cloud | Enterprise | Production, zero-ops |
| Weaviate | Both | Large | Hybrid search, GraphQL |
| FAISS | In-memory | Large | Speed-critical, read-heavy |
| Qdrant | Both | Large | Filtering + vector search |
Retrieval Optimization
# Hybrid search: BM25 (keyword) + Semantic (vector)
from langchain.retrievers import EnsembleRetriever
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7] # Weight semantic higher
)
# Reranking: Cross-encoder for precision
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Pipeline: Retrieve 20 → Rerank → Return top 3-5
candidates = ensemble.get_relevant_documents(query, k=20)
reranked = reranker.rank(query, [doc.page_content for doc in candidates])
final_docs = reranked[:5]
Similarity Metrics
| Metric | Formula | When to Use |
|---|---|---|
| Cosine similarity | A·B / (||A|| × ||B||) | Default for embeddings |
| Dot product | A·B | Normalized vectors (faster) |
| Euclidean (L2) | √Σ(a-b)² | Absolute distance matters |
Master These Concepts with Practice
Our NCP-AAI practice bundle includes:
- 7 full practice exams (455+ questions)
- Detailed explanations for every answer
- Domain-by-domain performance tracking
30-day money-back guarantee
Domain 7: NVIDIA Platform Implementation (7%)
NVIDIA NIM (Inference Microservices)
# Deploy optimized LLM inference with NIM
docker run -it --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
# OpenAI-compatible API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100}'
Key NIM Features:
- TensorRT-LLM optimizations (3-5x speedup)
- Multi-GPU support with tensor parallelism
- OpenAI API compatibility
- Built-in health checks and metrics
Triton Inference Server
# config.pbtxt for model serving
name: "agent-llm"
platform: "tensorrt_llm"
max_batch_size: 8
dynamic_batching {
preferred_batch_size: [4, 8]
max_queue_delay_microseconds: 100
}
instance_group [
{ count: 1, kind: KIND_GPU }
]
Triton Use Cases:
- Multi-model serving (host embedding + LLM + reranker)
- Dynamic batching (combine requests for efficiency)
- Model ensembles (chain retriever → LLM)
NeMo Guardrails
# Example: Prevent off-topic conversations
define user ask off topic
"What's the weather?"
"Tell me a joke"
"Who won the game?"
define flow off topic
user ask off topic
bot refuse off topic
"I'm focused on helping with [your domain]. Let me know
how I can assist with that."
# Example: Require human approval for actions
define flow high stakes action
user request financial transaction
bot confirm with human
"This action requires human approval. Routing to a reviewer."
NVIDIA Platform Quick Reference
| Tool | Purpose | Key Use |
|---|---|---|
| NIM | Model deployment | Optimized inference containers |
| Triton | Model serving | Multi-model, dynamic batching |
| NeMo | Model development | Training, fine-tuning, RLHF |
| NeMo Guardrails | Agent safety | Content filtering, topic control |
| TensorRT-LLM | Optimization | Quantization, kernel fusion |
| NGC | Container registry | Pre-built AI containers |
Domain 8: Run, Monitor, and Maintain (5%)
Production Monitoring
| What to Monitor | Metric | Alert Threshold |
|---|---|---|
| Latency | P50, P95, P99 | P95 > 2x baseline |
| Task completion | Success rate | < 80% over 1 hour |
| Error rate | Errors/total | > 5% over 15 min |
| Token usage | Tokens/interaction | > 2x average |
| Cost | $/interaction | > budget threshold |
| Model drift | Quality score trend | Declining 3+ days |
Distributed Tracing for Agents
Trace: user_query_123
├── Agent Orchestrator (50ms)
│ ├── Retrieve from Vector DB (120ms)
│ ├── LLM Reasoning Step 1 (800ms)
│ ├── Tool Call: search_api (350ms)
│ ├── LLM Reasoning Step 2 (750ms)
│ └── Format Response (30ms)
└── Total: 2100ms
Tools: OpenTelemetry, LangSmith, Datadog, Grafana
Domain 9: Safety, Ethics, and Compliance (5%)
Agent Safety Guardrails
| Guardrail | Implementation | Purpose |
|---|---|---|
| Input filtering | Regex + classifier | Block prompt injection |
| Output filtering | Content classifier | Prevent harmful outputs |
| Action constraints | Allowlist of tools | Limit agent capabilities |
| Rate limiting | Token/request budgets | Prevent runaway costs |
| Sandbox execution | Isolated environments | Safe code/API execution |
| Audit logging | Immutable logs | Compliance and debugging |
Compliance Quick Reference
| Regulation | Key Requirements for Agents |
|---|---|
| GDPR | Data minimization, right to erasure, consent, right to explanation |
| CCPA | Disclosure of data usage, opt-out of data sale |
| EU AI Act | Risk classification, transparency, human oversight for high-risk |
Domain 10: Human-AI Interaction and Oversight (5%)
HITL Escalation Framework
# Confidence-based escalation
def should_escalate(agent_response):
if agent_response.confidence < 0.7:
return "low_confidence"
if agent_response.involves_financial_action:
return "high_stakes"
if agent_response.sentiment == "frustrated":
return "user_sentiment"
return None # No escalation needed
# Escalation tiers
ESCALATION_TIERS = {
"low_confidence": "queue_for_review", # Async review
"high_stakes": "immediate_handoff", # Real-time human
"user_sentiment": "offer_human_option", # User choice
}
Transparency Best Practices
| Practice | Implementation |
|---|---|
| Source attribution | Show which documents RAG retrieved |
| Confidence display | Show certainty level to user |
| Decision explanation | Explain why agent chose specific action |
| Limitation disclosure | State what agent cannot do |
| Human option | Always provide escalation path |
Exam Strategy Quick Tips
Time Management
- 60-70 questions in 120 minutes = ~1.7-2 min per question
- Flag uncertain questions, return at end
- Aim to finish with 10-minute buffer for review
Domain Weight Summary
Architecture + Development: 30% (~21 questions) ← FOCUS HERE
Evaluation + Deployment: 26% (~18 questions)
Cognition + Knowledge: 20% (~14 questions)
Monitor + Safety + Human-AI: 15% (~11 questions) ← Easiest points
NVIDIA Platform: 7% (~5 questions) ← Small but specific
Common Wrong Answer Patterns
| Wrong | Right |
|---|---|
| "More agents = better performance" | "Multi-agent adds coordination overhead; use only when task decomposition justifies it" |
| "Larger chunks = better RAG" | "Chunk size is a precision-recall trade-off; smaller = more precise, larger = more context" |
| "ReAct is always best" | "Plan-and-Execute is better for well-defined sequential tasks" |
| "Fine-tune first" | "Try prompt engineering first; fine-tune only when prompting fails" |
| "Autonomous is always better" | "HITL escalation is preferred for high-stakes decisions" |
Domain Coverage Checklist
- Agent Architecture (15%): Patterns, single vs multi-agent, orchestration
- Agent Development (15%): Tool calling, error handling, frameworks
- Evaluation & Tuning (13%): Metrics, A/B testing, fine-tuning decisions
- Deployment & Scaling (13%): Containers, Kubernetes, scaling strategies
- Cognition & Memory (10%): Reasoning frameworks, memory types, context management
- Knowledge Integration (10%): RAG, chunking, vector DBs, hybrid search
- NVIDIA Platform (7%): NIM, Triton, NeMo Guardrails, TensorRT-LLM
- Run & Monitor (5%): Observability, tracing, alerting, drift detection
- Safety & Ethics (5%): Guardrails, compliance (GDPR, AI Act), red-teaming
- Human-AI Interaction (5%): HITL, confidence thresholds, transparency
Additional Resources
Official NVIDIA:
- NCP-AAI Certification Page
- NVIDIA DLI Courses (free with registration)
Technical References:
Practice:
- Preporato NCP-AAI Practice Exams - 7 full-length tests (420+ questions)
Print Instructions: This cheat sheet is optimized for 3-page printing. Use landscape orientation for best results.
Last Updated: March 8, 2026
Based on official NVIDIA NCP-AAI exam guide. All domain weights and topics verified against official sources.
Sources:
- NVIDIA NCP-AAI Official Certification
- NVIDIA NeMo Guardrails Documentation
- LangChain Agent Documentation
- NVIDIA NIM Documentation
Ready to Pass the NCP-AAI Exam?
Join thousands who passed with Preporato practice tests
