NVIDIA NCP-AAI Cheat Sheet: Complete Agentic AI Reference [2026] | NCP-AAI Study Guide

Print-friendly | Download PDF | Save for Exam Day

This comprehensive cheat sheet covers all 10 exam domains with architecture patterns, comparison tables, and code snippets. Based on the official NVIDIA NCP-AAI exam guide (15/15/13/13/10/10/7/5/5/5 weighting).

Domain 1: Agent Architecture and Design (15%)

Agent Architecture Patterns

Pattern	How It Works	Best For	Trade-offs
ReAct	Interleaved Reasoning + Action loops	Dynamic tasks with tools	Flexible but can loop; higher latency
Plan-and-Execute	Create full plan → execute steps	Well-defined multi-step tasks	Efficient but brittle to plan changes
Reflexion	Execute → Self-evaluate → Retry	Accuracy-critical tasks	Higher accuracy but 2-3x more LLM calls
LATS	Monte Carlo Tree Search for planning	Complex optimization tasks	Best accuracy, highest compute cost
Tool-Only	Direct tool routing, minimal reasoning	Simple tool dispatch	Fast but limited reasoning capability

When to use which:

Task is dynamic with unknown steps? → ReAct
Task is well-defined and sequential? → Plan-and-Execute
Accuracy is critical, latency flexible? → Reflexion
Multiple valid solution paths exist? → LATS
Simple tool routing, no reasoning? → Tool-Only

Single-Agent vs Multi-Agent

Factor	Single Agent	Multi-Agent
Use when	Task is linear, <5 tools	Task has distinct sub-domains
Complexity	Low	High (coordination overhead)
Latency	Lower	Higher (message passing)
Scalability	Limited	Better (parallel execution)
Debugging	Easier	Harder (distributed state)

Multi-Agent Orchestration Patterns:

Sequential:    Agent A → Agent B → Agent C  (pipeline)
Parallel:      Agent A ↗ Agent B ↗ Agent C  (fan-out, merge)
Hierarchical:  Orchestrator → [Worker A, Worker B, Worker C]
Collaborative: Agents negotiate and share state

Agent Communication

Protocol	Pattern	Use Case
Direct messaging	Agent-to-agent	Small teams, low latency
Publish-subscribe	Event-driven	Loose coupling, scalability
Shared state	Blackboard pattern	Collaborative problem-solving
A2A Protocol	Cross-platform	Interoperability between frameworks

Preparing for NCP-AAI? Practice with 455+ exam questions

Try Free View Bundle - $19.99

Domain 2: Agent Development (15%)

Tool/Function Calling

# OpenAI-compatible function calling format
tools = [{
    "type": "function",
    "function": {
        "name": "search_database",
        "description": "Search product database by query",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "limit": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        }
    }
}]

# Always validate tool parameters before execution
def safe_tool_call(tool_name, params):
    validated = validate_params(tool_name, params)
    try:
        result = execute_tool(tool_name, validated)
        return result
    except ToolError as e:
        return fallback_response(tool_name, e)

Error Handling Patterns

Pattern	When to Use	Implementation
Retry with backoff	Transient failures (API timeouts)	Exponential: 1s → 2s → 4s → 8s, max 3 retries
Circuit breaker	Repeated failures from same tool	Open after 3 failures, half-open after 30s
Fallback	Primary tool unavailable	Switch to alternative tool or graceful message
Graceful degradation	Non-critical tool failure	Continue with partial information

# Circuit breaker pattern
class CircuitBreaker:
    def __init__(self, failure_threshold=3, reset_timeout=30):
        self.failures = 0
        self.threshold = failure_threshold
        self.state = "closed"  # closed → open → half-open

    def call(self, func, *args):
        if self.state == "open":
            if time_since_open > self.reset_timeout:
                self.state = "half-open"
            else:
                return fallback()

        try:
            result = func(*args)
            self.failures = 0
            self.state = "closed"
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "open"
            raise

Agent Frameworks Comparison

Framework	Best For	Multi-Agent	Key Feature
LangChain	General agents	Via LangGraph	Largest ecosystem, most tools
LlamaIndex	RAG-heavy agents	Limited	Best RAG integration
AutoGen	Multi-agent chat	Native	Conversational agent teams
CrewAI	Role-based teams	Native	Role + goal + backstory agents
LangGraph	Stateful workflows	Native	Graph-based agent orchestration

Domain 3: Evaluation and Tuning (13%)

Agent Evaluation Metrics

Metric	What It Measures	Target Range
Task completion rate	% of tasks fully completed	>85% for production
Reasoning accuracy	Correctness of intermediate steps	>90% for critical tasks
End-to-end latency	Total time from input to output	<5s for interactive, <30s for async
Cost per interaction	Total LLM + tool API costs	Monitor trend, set budget alerts
Tool selection accuracy	% of correct tool choices	>90%
Hallucination rate	% of unsupported claims	<5% with RAG
User satisfaction (CSAT)	User-reported quality	>4.0/5.0

A/B Testing for Agents

1. Define hypothesis: "ReAct with CoT outperforms vanilla ReAct"
2. Split traffic: 50/50 random assignment
3. Measure: Task completion, latency, cost, user satisfaction
4. Duration: Minimum 1000 interactions per variant
5. Statistical significance: p < 0.05
6. Decision: Roll out winner, document learnings

Fine-Tuning Decision Guide

Scenario	Approach	Why
Agent needs domain vocabulary	LoRA fine-tune	Adapts to terminology without full retrain
Agent formatting is inconsistent	Prompt engineering first	Cheaper, faster iteration
Tool selection is poor	Fine-tune on tool-use dataset	Improves function calling accuracy
General quality is low	Upgrade base model	Fine-tuning can't fix weak foundations

Domain 4: Deployment and Scaling (13%)

Containerized Agent Deployment

# docker-compose.yml for agent system
services:
  agent-api:
    image: agent-service:latest
    deploy:
      replicas: 3
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    environment:
      - MODEL_ENDPOINT=http://nim-server:8000
      - VECTOR_DB_URL=http://chromadb:8000

  nim-server:
    image: nvcr.io/nim/meta/llama-3-8b-instruct:latest
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    ports:
      - "8000:8000"

Scaling Strategies

Strategy	When to Use	Implementation
Horizontal	More concurrent users	Add agent replicas behind load balancer
Vertical	Larger models, more memory	Upgrade GPU (A100 → H100)
Auto-scaling	Variable load patterns	Scale on queue depth or latency metrics
GPU sharing	Multiple small models	Triton multi-model serving

Deployment Strategies

Strategy	Risk	Rollback	Use When
Blue/Green	Low	Instant switch	Major agent updates, new models
Canary	Very Low	Fast	Gradual rollout, measure impact
Rolling	Medium	Slow	Minor updates, stateless services
Shadow	None	N/A	Testing new agent in parallel

Domain 5: Cognition, Planning, and Memory (10%)

Reasoning Frameworks

Framework	Mechanism	Best For	Latency
Chain-of-Thought (CoT)	Step-by-step reasoning	Linear problems, math	1.5-2x base
Tree-of-Thoughts (ToT)	Branching exploration	Creative/strategic tasks	3-5x base
ReAct Reasoning	Thought → Action → Observation	Tool-using agents	2-3x base
MCTS	Monte Carlo search over plans	Optimization problems	5-10x base
Self-Consistency	Multiple CoT, majority vote	High-stakes decisions	3-5x base

CoT:  Think → Think → Think → Answer (linear)
ToT:  Think → Branch → Evaluate → Select → Think (tree)
ReAct: Think → Act → Observe → Think → Act (loop)
MCTS: Simulate → Evaluate → Backprop → Select (search)

Memory Systems

Memory Type	Storage	Duration	Use Case
Short-term	Context window	Current session	Active conversation
Long-term	Vector database	Persistent	User preferences, past interactions
Episodic	Key-value store	Persistent	Specific past events and outcomes
Semantic	Knowledge graph	Persistent	Factual knowledge, relationships
Working	Scratchpad	Current task	Intermediate reasoning steps

Memory Architecture Pattern:

User Input → Short-term Memory (context window)
           → Retrieve from Long-term Memory (vector DB)
           → Check Episodic Memory (similar past scenarios)
           → Agent Processes with Working Memory
           → Store important results to Long-term Memory

Context Window Management

# Token budget allocation
total_tokens = 8192  # Model context window

system_prompt = 500     # ~6% - Agent instructions
retrieved_docs = 3000   # ~37% - RAG context
conversation = 2000     # ~24% - Chat history
working_memory = 1000   # ~12% - Scratchpad
output_reserve = 1692   # ~21% - Generation space

# Compression strategies when exceeding budget:
# 1. Summarize older conversation turns
# 2. Reduce retrieved docs (top-3 → top-2)
# 3. Compress working memory

Domain 6: Knowledge Integration and Data Handling (10%)

RAG Pipeline

Documents → Chunk → Embed → Store (Vector DB)
                                    ↓
Query → Embed → Search → Retrieve Top-K → Rerank → Augment Prompt → LLM → Response

Chunking Strategies

Strategy	Chunk Size	Overlap	Best For
Fixed-size	512 tokens	10-15%	General purpose, fast
Semantic	Variable	At topic breaks	Long documents, mixed topics
Recursive	512-1024	15-20%	Structured docs (markdown, code)
Document-level	Full doc	N/A	Short documents, FAQs

Best practices:

Preserve sentence boundaries
Include metadata (source, page, date)
Test with your domain data
Smaller chunks = more precise, less context
Larger chunks = more context, less precise

Vector Database Comparison

Database	Hosting	Scalability	Best For
ChromaDB	Self-hosted	Small-medium	Prototyping, local dev
Pinecone	Managed cloud	Enterprise	Production, zero-ops
Weaviate	Both	Large	Hybrid search, GraphQL
FAISS	In-memory	Large	Speed-critical, read-heavy
Qdrant	Both	Large	Filtering + vector search

Retrieval Optimization

# Hybrid search: BM25 (keyword) + Semantic (vector)
from langchain.retrievers import EnsembleRetriever

ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]  # Weight semantic higher
)

# Reranking: Cross-encoder for precision
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Pipeline: Retrieve 20 → Rerank → Return top 3-5
candidates = ensemble.get_relevant_documents(query, k=20)
reranked = reranker.rank(query, [doc.page_content for doc in candidates])
final_docs = reranked[:5]

Similarity Metrics

Metric	Formula	When to Use
Cosine similarity	A·B / (\|\|A\|\| × \|\|B\|\|)	Default for embeddings
Dot product	A·B	Normalized vectors (faster)
Euclidean (L2)	√Σ(a-b)²	Absolute distance matters

Master These Concepts with Practice

Our NCP-AAI practice bundle includes:

7 full practice exams (455+ questions)
Detailed explanations for every answer
Domain-by-domain performance tracking

Try 15 Free Questions Get Full Access - $19.99

30-day money-back guarantee

Domain 7: NVIDIA Platform Implementation (7%)

NVIDIA NIM (Inference Microservices)

# Deploy optimized LLM inference with NIM
docker run -it --gpus all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# OpenAI-compatible API
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta/llama-3.1-8b-instruct",
       "messages": [{"role": "user", "content": "Hello"}],
       "max_tokens": 100}'

Key NIM Features:

TensorRT-LLM optimizations (3-5x speedup)
Multi-GPU support with tensor parallelism
OpenAI API compatibility
Built-in health checks and metrics

Triton Inference Server

# config.pbtxt for model serving
name: "agent-llm"
platform: "tensorrt_llm"
max_batch_size: 8

dynamic_batching {
  preferred_batch_size: [4, 8]
  max_queue_delay_microseconds: 100
}

instance_group [
  { count: 1, kind: KIND_GPU }
]

Triton Use Cases:

Multi-model serving (host embedding + LLM + reranker)
Dynamic batching (combine requests for efficiency)
Model ensembles (chain retriever → LLM)

NeMo Guardrails

# Example: Prevent off-topic conversations
define user ask off topic
  "What's the weather?"
  "Tell me a joke"
  "Who won the game?"

define flow off topic
  user ask off topic
  bot refuse off topic
  "I'm focused on helping with [your domain]. Let me know
   how I can assist with that."

# Example: Require human approval for actions
define flow high stakes action
  user request financial transaction
  bot confirm with human
  "This action requires human approval. Routing to a reviewer."

NVIDIA Platform Quick Reference

Tool	Purpose	Key Use
NIM	Model deployment	Optimized inference containers
Triton	Model serving	Multi-model, dynamic batching
NeMo	Model development	Training, fine-tuning, RLHF
NeMo Guardrails	Agent safety	Content filtering, topic control
TensorRT-LLM	Optimization	Quantization, kernel fusion
NGC	Container registry	Pre-built AI containers

Domain 8: Run, Monitor, and Maintain (5%)

Production Monitoring

What to Monitor	Metric	Alert Threshold
Latency	P50, P95, P99	P95 > 2x baseline
Task completion	Success rate	< 80% over 1 hour
Error rate	Errors/total	> 5% over 15 min
Token usage	Tokens/interaction	> 2x average
Cost	$/interaction	> budget threshold
Model drift	Quality score trend	Declining 3+ days

Distributed Tracing for Agents

Trace: user_query_123
├── Agent Orchestrator (50ms)
│   ├── Retrieve from Vector DB (120ms)
│   ├── LLM Reasoning Step 1 (800ms)
│   ├── Tool Call: search_api (350ms)
│   ├── LLM Reasoning Step 2 (750ms)
│   └── Format Response (30ms)
└── Total: 2100ms

Tools: OpenTelemetry, LangSmith, Datadog, Grafana

Domain 9: Safety, Ethics, and Compliance (5%)

Agent Safety Guardrails

Guardrail	Implementation	Purpose
Input filtering	Regex + classifier	Block prompt injection
Output filtering	Content classifier	Prevent harmful outputs
Action constraints	Allowlist of tools	Limit agent capabilities
Rate limiting	Token/request budgets	Prevent runaway costs
Sandbox execution	Isolated environments	Safe code/API execution
Audit logging	Immutable logs	Compliance and debugging

Compliance Quick Reference

Regulation	Key Requirements for Agents
GDPR	Data minimization, right to erasure, consent, right to explanation
CCPA	Disclosure of data usage, opt-out of data sale
EU AI Act	Risk classification, transparency, human oversight for high-risk

Domain 10: Human-AI Interaction and Oversight (5%)

HITL Escalation Framework

# Confidence-based escalation
def should_escalate(agent_response):
    if agent_response.confidence < 0.7:
        return "low_confidence"
    if agent_response.involves_financial_action:
        return "high_stakes"
    if agent_response.sentiment == "frustrated":
        return "user_sentiment"
    return None  # No escalation needed

# Escalation tiers
ESCALATION_TIERS = {
    "low_confidence": "queue_for_review",     # Async review
    "high_stakes": "immediate_handoff",       # Real-time human
    "user_sentiment": "offer_human_option",   # User choice
}

Transparency Best Practices

Practice	Implementation
Source attribution	Show which documents RAG retrieved
Confidence display	Show certainty level to user
Decision explanation	Explain why agent chose specific action
Limitation disclosure	State what agent cannot do
Human option	Always provide escalation path

Exam Strategy Quick Tips

Time Management

60-70 questions in 120 minutes = ~1.7-2 min per question
Flag uncertain questions, return at end
Aim to finish with 10-minute buffer for review

Domain Weight Summary

Architecture + Development:  30%  (~21 questions) ← FOCUS HERE
Evaluation + Deployment:     26%  (~18 questions)
Cognition + Knowledge:       20%  (~14 questions)
Monitor + Safety + Human-AI: 15%  (~11 questions) ← Easiest points
NVIDIA Platform:              7%  (~5 questions)  ← Small but specific

Common Wrong Answer Patterns

Wrong	Right
"More agents = better performance"	"Multi-agent adds coordination overhead; use only when task decomposition justifies it"
"Larger chunks = better RAG"	"Chunk size is a precision-recall trade-off; smaller = more precise, larger = more context"
"ReAct is always best"	"Plan-and-Execute is better for well-defined sequential tasks"
"Fine-tune first"	"Try prompt engineering first; fine-tune only when prompting fails"
"Autonomous is always better"	"HITL escalation is preferred for high-stakes decisions"

Domain Coverage Checklist

Agent Architecture (15%): Patterns, single vs multi-agent, orchestration
Agent Development (15%): Tool calling, error handling, frameworks
Evaluation & Tuning (13%): Metrics, A/B testing, fine-tuning decisions
Deployment & Scaling (13%): Containers, Kubernetes, scaling strategies
Cognition & Memory (10%): Reasoning frameworks, memory types, context management
Knowledge Integration (10%): RAG, chunking, vector DBs, hybrid search
NVIDIA Platform (7%): NIM, Triton, NeMo Guardrails, TensorRT-LLM
Run & Monitor (5%): Observability, tracing, alerting, drift detection
Safety & Ethics (5%): Guardrails, compliance (GDPR, AI Act), red-teaming
Human-AI Interaction (5%): HITL, confidence thresholds, transparency

Additional Resources

Official NVIDIA:

NCP-AAI Certification Page
NVIDIA DLI Courses (free with registration)

Technical References:

Practice:

Preporato NCP-AAI Practice Exams - 7 full-length tests (420+ questions)

Print Instructions: This cheat sheet is optimized for 3-page printing. Use landscape orientation for best results.

Last Updated: March 8, 2026

Based on official NVIDIA NCP-AAI exam guide. All domain weights and topics verified against official sources.

Sources:

Ready to Pass the NCP-AAI Exam?

Join thousands who passed with Preporato practice tests

Start Practicing Now - $19.99

Instant access30-day guaranteeUpdated monthly