Preporato
NCP-AAINVIDIA Certified Professional - Agentic AINVIDIACheat SheetAgentic AIQuick Reference

NVIDIA NCP-AAI Cheat Sheet: Complete Agentic AI Reference [2026]

Preporato TeamMarch 8, 202612 min readNCP-AAI

Print-friendly | Download PDF | Save for Exam Day

This comprehensive cheat sheet covers all 10 exam domains with architecture patterns, comparison tables, and code snippets. Based on the official NVIDIA NCP-AAI exam guide (15/15/13/13/10/10/7/5/5/5 weighting).


Domain 1: Agent Architecture and Design (15%)

Agent Architecture Patterns

PatternHow It WorksBest ForTrade-offs
ReActInterleaved Reasoning + Action loopsDynamic tasks with toolsFlexible but can loop; higher latency
Plan-and-ExecuteCreate full plan → execute stepsWell-defined multi-step tasksEfficient but brittle to plan changes
ReflexionExecute → Self-evaluate → RetryAccuracy-critical tasksHigher accuracy but 2-3x more LLM calls
LATSMonte Carlo Tree Search for planningComplex optimization tasksBest accuracy, highest compute cost
Tool-OnlyDirect tool routing, minimal reasoningSimple tool dispatchFast but limited reasoning capability

When to use which:

Task is dynamic with unknown steps? → ReAct
Task is well-defined and sequential? → Plan-and-Execute
Accuracy is critical, latency flexible? → Reflexion
Multiple valid solution paths exist? → LATS
Simple tool routing, no reasoning? → Tool-Only

Single-Agent vs Multi-Agent

FactorSingle AgentMulti-Agent
Use whenTask is linear, <5 toolsTask has distinct sub-domains
ComplexityLowHigh (coordination overhead)
LatencyLowerHigher (message passing)
ScalabilityLimitedBetter (parallel execution)
DebuggingEasierHarder (distributed state)

Multi-Agent Orchestration Patterns:

Sequential:    Agent A → Agent B → Agent C  (pipeline)
Parallel:      Agent A ↗ Agent B ↗ Agent C  (fan-out, merge)
Hierarchical:  Orchestrator → [Worker A, Worker B, Worker C]
Collaborative: Agents negotiate and share state

Agent Communication

ProtocolPatternUse Case
Direct messagingAgent-to-agentSmall teams, low latency
Publish-subscribeEvent-drivenLoose coupling, scalability
Shared stateBlackboard patternCollaborative problem-solving
A2A ProtocolCross-platformInteroperability between frameworks

Preparing for NCP-AAI? Practice with 455+ exam questions

Domain 2: Agent Development (15%)

Tool/Function Calling

# OpenAI-compatible function calling format
tools = [{
    "type": "function",
    "function": {
        "name": "search_database",
        "description": "Search product database by query",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "limit": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        }
    }
}]

# Always validate tool parameters before execution
def safe_tool_call(tool_name, params):
    validated = validate_params(tool_name, params)
    try:
        result = execute_tool(tool_name, validated)
        return result
    except ToolError as e:
        return fallback_response(tool_name, e)

Error Handling Patterns

PatternWhen to UseImplementation
Retry with backoffTransient failures (API timeouts)Exponential: 1s → 2s → 4s → 8s, max 3 retries
Circuit breakerRepeated failures from same toolOpen after 3 failures, half-open after 30s
FallbackPrimary tool unavailableSwitch to alternative tool or graceful message
Graceful degradationNon-critical tool failureContinue with partial information
# Circuit breaker pattern
class CircuitBreaker:
    def __init__(self, failure_threshold=3, reset_timeout=30):
        self.failures = 0
        self.threshold = failure_threshold
        self.state = "closed"  # closed → open → half-open

    def call(self, func, *args):
        if self.state == "open":
            if time_since_open > self.reset_timeout:
                self.state = "half-open"
            else:
                return fallback()

        try:
            result = func(*args)
            self.failures = 0
            self.state = "closed"
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "open"
            raise

Agent Frameworks Comparison

FrameworkBest ForMulti-AgentKey Feature
LangChainGeneral agentsVia LangGraphLargest ecosystem, most tools
LlamaIndexRAG-heavy agentsLimitedBest RAG integration
AutoGenMulti-agent chatNativeConversational agent teams
CrewAIRole-based teamsNativeRole + goal + backstory agents
LangGraphStateful workflowsNativeGraph-based agent orchestration

Domain 3: Evaluation and Tuning (13%)

Agent Evaluation Metrics

MetricWhat It MeasuresTarget Range
Task completion rate% of tasks fully completed>85% for production
Reasoning accuracyCorrectness of intermediate steps>90% for critical tasks
End-to-end latencyTotal time from input to output<5s for interactive, <30s for async
Cost per interactionTotal LLM + tool API costsMonitor trend, set budget alerts
Tool selection accuracy% of correct tool choices>90%
Hallucination rate% of unsupported claims<5% with RAG
User satisfaction (CSAT)User-reported quality>4.0/5.0

A/B Testing for Agents

1. Define hypothesis: "ReAct with CoT outperforms vanilla ReAct"
2. Split traffic: 50/50 random assignment
3. Measure: Task completion, latency, cost, user satisfaction
4. Duration: Minimum 1000 interactions per variant
5. Statistical significance: p < 0.05
6. Decision: Roll out winner, document learnings

Fine-Tuning Decision Guide

ScenarioApproachWhy
Agent needs domain vocabularyLoRA fine-tuneAdapts to terminology without full retrain
Agent formatting is inconsistentPrompt engineering firstCheaper, faster iteration
Tool selection is poorFine-tune on tool-use datasetImproves function calling accuracy
General quality is lowUpgrade base modelFine-tuning can't fix weak foundations

Domain 4: Deployment and Scaling (13%)

Containerized Agent Deployment

# docker-compose.yml for agent system
services:
  agent-api:
    image: agent-service:latest
    deploy:
      replicas: 3
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    environment:
      - MODEL_ENDPOINT=http://nim-server:8000
      - VECTOR_DB_URL=http://chromadb:8000

  nim-server:
    image: nvcr.io/nim/meta/llama-3-8b-instruct:latest
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    ports:
      - "8000:8000"

Scaling Strategies

StrategyWhen to UseImplementation
HorizontalMore concurrent usersAdd agent replicas behind load balancer
VerticalLarger models, more memoryUpgrade GPU (A100 → H100)
Auto-scalingVariable load patternsScale on queue depth or latency metrics
GPU sharingMultiple small modelsTriton multi-model serving

Deployment Strategies

StrategyRiskRollbackUse When
Blue/GreenLowInstant switchMajor agent updates, new models
CanaryVery LowFastGradual rollout, measure impact
RollingMediumSlowMinor updates, stateless services
ShadowNoneN/ATesting new agent in parallel

Domain 5: Cognition, Planning, and Memory (10%)

Reasoning Frameworks

FrameworkMechanismBest ForLatency
Chain-of-Thought (CoT)Step-by-step reasoningLinear problems, math1.5-2x base
Tree-of-Thoughts (ToT)Branching explorationCreative/strategic tasks3-5x base
ReAct ReasoningThought → Action → ObservationTool-using agents2-3x base
MCTSMonte Carlo search over plansOptimization problems5-10x base
Self-ConsistencyMultiple CoT, majority voteHigh-stakes decisions3-5x base
CoT:  Think → Think → Think → Answer (linear)
ToT:  Think → Branch → Evaluate → Select → Think (tree)
ReAct: Think → Act → Observe → Think → Act (loop)
MCTS: Simulate → Evaluate → Backprop → Select (search)

Memory Systems

Memory TypeStorageDurationUse Case
Short-termContext windowCurrent sessionActive conversation
Long-termVector databasePersistentUser preferences, past interactions
EpisodicKey-value storePersistentSpecific past events and outcomes
SemanticKnowledge graphPersistentFactual knowledge, relationships
WorkingScratchpadCurrent taskIntermediate reasoning steps

Memory Architecture Pattern:

User Input → Short-term Memory (context window)
           → Retrieve from Long-term Memory (vector DB)
           → Check Episodic Memory (similar past scenarios)
           → Agent Processes with Working Memory
           → Store important results to Long-term Memory

Context Window Management

# Token budget allocation
total_tokens = 8192  # Model context window

system_prompt = 500     # ~6% - Agent instructions
retrieved_docs = 3000   # ~37% - RAG context
conversation = 2000     # ~24% - Chat history
working_memory = 1000   # ~12% - Scratchpad
output_reserve = 1692   # ~21% - Generation space

# Compression strategies when exceeding budget:
# 1. Summarize older conversation turns
# 2. Reduce retrieved docs (top-3 → top-2)
# 3. Compress working memory

Domain 6: Knowledge Integration and Data Handling (10%)

RAG Pipeline

Documents → Chunk → Embed → Store (Vector DB)
                                    ↓
Query → Embed → Search → Retrieve Top-K → Rerank → Augment Prompt → LLM → Response

Chunking Strategies

StrategyChunk SizeOverlapBest For
Fixed-size512 tokens10-15%General purpose, fast
SemanticVariableAt topic breaksLong documents, mixed topics
Recursive512-102415-20%Structured docs (markdown, code)
Document-levelFull docN/AShort documents, FAQs

Best practices:

  • Preserve sentence boundaries
  • Include metadata (source, page, date)
  • Test with your domain data
  • Smaller chunks = more precise, less context
  • Larger chunks = more context, less precise

Vector Database Comparison

DatabaseHostingScalabilityBest For
ChromaDBSelf-hostedSmall-mediumPrototyping, local dev
PineconeManaged cloudEnterpriseProduction, zero-ops
WeaviateBothLargeHybrid search, GraphQL
FAISSIn-memoryLargeSpeed-critical, read-heavy
QdrantBothLargeFiltering + vector search

Retrieval Optimization

# Hybrid search: BM25 (keyword) + Semantic (vector)
from langchain.retrievers import EnsembleRetriever

ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]  # Weight semantic higher
)

# Reranking: Cross-encoder for precision
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Pipeline: Retrieve 20 → Rerank → Return top 3-5
candidates = ensemble.get_relevant_documents(query, k=20)
reranked = reranker.rank(query, [doc.page_content for doc in candidates])
final_docs = reranked[:5]

Similarity Metrics

MetricFormulaWhen to Use
Cosine similarityA·B / (||A|| × ||B||)Default for embeddings
Dot productA·BNormalized vectors (faster)
Euclidean (L2)√Σ(a-b)²Absolute distance matters

Master These Concepts with Practice

Our NCP-AAI practice bundle includes:

  • 7 full practice exams (455+ questions)
  • Detailed explanations for every answer
  • Domain-by-domain performance tracking

30-day money-back guarantee

Domain 7: NVIDIA Platform Implementation (7%)

NVIDIA NIM (Inference Microservices)

# Deploy optimized LLM inference with NIM
docker run -it --gpus all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# OpenAI-compatible API
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta/llama-3.1-8b-instruct",
       "messages": [{"role": "user", "content": "Hello"}],
       "max_tokens": 100}'

Key NIM Features:

  • TensorRT-LLM optimizations (3-5x speedup)
  • Multi-GPU support with tensor parallelism
  • OpenAI API compatibility
  • Built-in health checks and metrics

Triton Inference Server

# config.pbtxt for model serving
name: "agent-llm"
platform: "tensorrt_llm"
max_batch_size: 8

dynamic_batching {
  preferred_batch_size: [4, 8]
  max_queue_delay_microseconds: 100
}

instance_group [
  { count: 1, kind: KIND_GPU }
]

Triton Use Cases:

  • Multi-model serving (host embedding + LLM + reranker)
  • Dynamic batching (combine requests for efficiency)
  • Model ensembles (chain retriever → LLM)

NeMo Guardrails

# Example: Prevent off-topic conversations
define user ask off topic
  "What's the weather?"
  "Tell me a joke"
  "Who won the game?"

define flow off topic
  user ask off topic
  bot refuse off topic
  "I'm focused on helping with [your domain]. Let me know
   how I can assist with that."

# Example: Require human approval for actions
define flow high stakes action
  user request financial transaction
  bot confirm with human
  "This action requires human approval. Routing to a reviewer."

NVIDIA Platform Quick Reference

ToolPurposeKey Use
NIMModel deploymentOptimized inference containers
TritonModel servingMulti-model, dynamic batching
NeMoModel developmentTraining, fine-tuning, RLHF
NeMo GuardrailsAgent safetyContent filtering, topic control
TensorRT-LLMOptimizationQuantization, kernel fusion
NGCContainer registryPre-built AI containers

Domain 8: Run, Monitor, and Maintain (5%)

Production Monitoring

What to MonitorMetricAlert Threshold
LatencyP50, P95, P99P95 > 2x baseline
Task completionSuccess rate< 80% over 1 hour
Error rateErrors/total> 5% over 15 min
Token usageTokens/interaction> 2x average
Cost$/interaction> budget threshold
Model driftQuality score trendDeclining 3+ days

Distributed Tracing for Agents

Trace: user_query_123
├── Agent Orchestrator (50ms)
│   ├── Retrieve from Vector DB (120ms)
│   ├── LLM Reasoning Step 1 (800ms)
│   ├── Tool Call: search_api (350ms)
│   ├── LLM Reasoning Step 2 (750ms)
│   └── Format Response (30ms)
└── Total: 2100ms

Tools: OpenTelemetry, LangSmith, Datadog, Grafana


Domain 9: Safety, Ethics, and Compliance (5%)

Agent Safety Guardrails

GuardrailImplementationPurpose
Input filteringRegex + classifierBlock prompt injection
Output filteringContent classifierPrevent harmful outputs
Action constraintsAllowlist of toolsLimit agent capabilities
Rate limitingToken/request budgetsPrevent runaway costs
Sandbox executionIsolated environmentsSafe code/API execution
Audit loggingImmutable logsCompliance and debugging

Compliance Quick Reference

RegulationKey Requirements for Agents
GDPRData minimization, right to erasure, consent, right to explanation
CCPADisclosure of data usage, opt-out of data sale
EU AI ActRisk classification, transparency, human oversight for high-risk

Domain 10: Human-AI Interaction and Oversight (5%)

HITL Escalation Framework

# Confidence-based escalation
def should_escalate(agent_response):
    if agent_response.confidence < 0.7:
        return "low_confidence"
    if agent_response.involves_financial_action:
        return "high_stakes"
    if agent_response.sentiment == "frustrated":
        return "user_sentiment"
    return None  # No escalation needed

# Escalation tiers
ESCALATION_TIERS = {
    "low_confidence": "queue_for_review",     # Async review
    "high_stakes": "immediate_handoff",       # Real-time human
    "user_sentiment": "offer_human_option",   # User choice
}

Transparency Best Practices

PracticeImplementation
Source attributionShow which documents RAG retrieved
Confidence displayShow certainty level to user
Decision explanationExplain why agent chose specific action
Limitation disclosureState what agent cannot do
Human optionAlways provide escalation path

Exam Strategy Quick Tips

Time Management

  • 60-70 questions in 120 minutes = ~1.7-2 min per question
  • Flag uncertain questions, return at end
  • Aim to finish with 10-minute buffer for review

Domain Weight Summary

Architecture + Development:  30%  (~21 questions) ← FOCUS HERE
Evaluation + Deployment:     26%  (~18 questions)
Cognition + Knowledge:       20%  (~14 questions)
Monitor + Safety + Human-AI: 15%  (~11 questions) ← Easiest points
NVIDIA Platform:              7%  (~5 questions)  ← Small but specific

Common Wrong Answer Patterns

WrongRight
"More agents = better performance""Multi-agent adds coordination overhead; use only when task decomposition justifies it"
"Larger chunks = better RAG""Chunk size is a precision-recall trade-off; smaller = more precise, larger = more context"
"ReAct is always best""Plan-and-Execute is better for well-defined sequential tasks"
"Fine-tune first""Try prompt engineering first; fine-tune only when prompting fails"
"Autonomous is always better""HITL escalation is preferred for high-stakes decisions"

Domain Coverage Checklist

  • Agent Architecture (15%): Patterns, single vs multi-agent, orchestration
  • Agent Development (15%): Tool calling, error handling, frameworks
  • Evaluation & Tuning (13%): Metrics, A/B testing, fine-tuning decisions
  • Deployment & Scaling (13%): Containers, Kubernetes, scaling strategies
  • Cognition & Memory (10%): Reasoning frameworks, memory types, context management
  • Knowledge Integration (10%): RAG, chunking, vector DBs, hybrid search
  • NVIDIA Platform (7%): NIM, Triton, NeMo Guardrails, TensorRT-LLM
  • Run & Monitor (5%): Observability, tracing, alerting, drift detection
  • Safety & Ethics (5%): Guardrails, compliance (GDPR, AI Act), red-teaming
  • Human-AI Interaction (5%): HITL, confidence thresholds, transparency

Additional Resources

Official NVIDIA:

Technical References:

Practice:


Print Instructions: This cheat sheet is optimized for 3-page printing. Use landscape orientation for best results.

Last Updated: March 8, 2026


Based on official NVIDIA NCP-AAI exam guide. All domain weights and topics verified against official sources.

Sources:

Ready to Pass the NCP-AAI Exam?

Join thousands who passed with Preporato practice tests

Instant access30-day guaranteeUpdated monthly