This comprehensive cheat sheet covers all 5 exam domains with formulas, comparison tables, and code snippets. Based on official NVIDIA exam guide (30/24/22/14/10 weighting).
Domain 1: Core ML & AI Knowledge (30%)
Transformer Architecture Overview
Self-Attention Visualization
Watch how Query (Q), Key (K), and Value (V) matrices interact through the attention mechanism. The animation shows the step-by-step process: creating Q/K/V from inputs, computing attention scores (Q×Kᵀ), applying softmax normalization, and multiplying by values to produce the final output.
Multi-Head Attention
MultiHead(Q,K,V) = Concat(head₁,...,headₕ)W^O
Copy
Multi-Head Attention Visualization
See how multi-head attention processes information in parallel. The animation shows the input being split into 4 attention heads (each with different colors), each head computing attention independently, then concatenating the results and applying the final projection matrix (W^O) to produce the output.
Component Comparison Table
Component
Purpose
Key Details
Positional Encoding
Preserve sequence order
Sinusoidal: PE(pos,2i) = sin(pos/10000^(2i/d))
Self-Attention
Model token relationships
Computes Q, K, V projections
Multi-Head Attention
Different representation subspaces
Typically 8-16 heads, d_model/h per head
Feed-Forward Network
Position-wise transformation
2 layers: d_model → 4×d_model → d_model
Layer Normalization
Stabilize training
Normalizes across features (not batch)
Residual Connections
Enable deep networks
x + Sublayer(x) pattern
Model Architecture Types
Type
Attention
Use Cases
Examples
Encoder-Only
Bidirectional (no masking)
Classification, NER, embeddings
BERT, RoBERTa
Decoder-Only
Causal/masked (autoregressive)
Text generation, completion
GPT-3/4, Llama 2/3
Encoder-Decoder
Both types + cross-attention
Translation, summarization
T5, BART, mT5
Key Differences:
Encoder: Can see full context (bidirectional), outputs contextualized representations
Decoder: Causal masking (can't see future), generates text token-by-token
Cross-Attention: Decoder queries encoder outputs (K, V from encoder, Q from decoder)
Attention Mechanism Steps
Compute Scores: QK^T (query-key similarity)
Scale: Divide by √d_k (stabilize gradients)
Apply Mask (decoder only): Set future positions to -∞
[System Role]
You are an expert data analyst.
[Context]
Dataset: {data_description}
[Task]
Analyze trends and provide insights on {specific_aspect}.
[Constraints]
- Use only provided data
- Cite sources with [Source: X]
[Format]
Respond in JSON: {"trends": [], "insights": []}
Dynamic batching (combine requests for efficiency)
Model ensembles (chain multiple models)
LangChain Essentials
Basic Chain:
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
template = "Summarize in 3 sentences: {text}"
prompt = PromptTemplate(template=template, input_variables=["text"])
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(text="Long document...")
Common Chain Types:
Chain
Purpose
Example
LLMChain
Single LLM call with prompt
Q&A, classification
SequentialChain
Chain multiple LLM calls
Multi-step reasoning
RetrievalQA
RAG pipeline
Knowledge-grounded QA
ConversationalRetrievalChain
RAG with chat history
Chatbots
AgentExecutor
Dynamic tool selection
Task automation
Agent with Tools:
from langchain.agents import initialize_agent, Tool
tools = [
Tool(name="Search", func=search_fn, description="Search the web"),
Tool(name="Calculator", func=calc_fn, description="Do math")
]
agent = initialize_agent(
tools, llm, agent="zero-shot-react-description", verbose=True
)
agent.run("What is 25% of the GDP of France?")
Hugging Face Transformers
Load and Generate:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Generate
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=100,
temperature=0.7,
top_p=0.9,
do_sample=True
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
import re
pii_patterns = {
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'
}
defdetect_pii(text):
found = {}
for pii_type, pattern in pii_patterns.items():
matches = re.findall(pattern, text)
if matches:
found[pii_type] = matches
return found
Toxicity Detection:
from transformers import pipeline
classifier = pipeline("text-classification",
model="unitary/toxic-bert")
result = classifier("Your text here")
# Output: [{'label': 'toxic', 'score': 0.02}]# Filter if score > 0.7if result[0]['score'] > 0.7:
return"Content filtered due to toxicity"
Bias Detection
Fairness Metrics:
Demographic Parity: P(Ŷ=1|A=0) = P(Ŷ=1|A=1)
Equal Opportunity: TPR equal across groups
Equalized Odds: TPR and FPR equal across groups
Testing Approach:
# 1. Create test sets for each demographic
test_sets = {
'male': male_examples,
'female': female_examples,
'non_binary': nb_examples
}
# 2. Measure metrics per group
metrics = {}
for group, examples in test_sets.items():
metrics[group] = {
'accuracy': calculate_accuracy(model, examples),
'f1': calculate_f1(model, examples)
}
# 3. Flag if disparity > 5%
max_disparity = max(metrics.values()) - min(metrics.values())
if max_disparity > 0.05:
print(f"⚠️ Bias detected: {max_disparity:.1%} disparity")
Hallucination Prevention
Strategies:
RAG: Ground responses in retrieved documents
Citations: Require model to cite sources
Confidence Scores: Filter low-confidence outputs
Verification: Cross-check facts with knowledge base
Temperature: Lower temperature (0.3-0.5) reduces creativity/hallucination
Citation Enforcement:
prompt = """
Use ONLY the provided context to answer.
If unsure, say "I don't have enough information."
ALWAYS cite sources: [Source: document_name]
Context:
{retrieved_docs}
Question: {question}
"""
Quick Command Reference
Model Loading & Optimization
Flash Attention 2 (2-4x faster):
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto"
)
Gradient Checkpointing (reduce memory):
model.gradient_checkpointing_enable()
# Trades compute for memory (20-30% slower, 30-40% less memory)
8-bit Quantization:
model = AutoModelForCausalLM.from_pretrained(
"model_name",
load_in_8bit=True,
device_map="auto"
)
# Reduces memory by ~50% with minimal accuracy loss
Tokenizer Operations
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Encode
token_ids = tokenizer.encode("Hello world", add_special_tokens=True)
# [15496, 995] + special tokens# Decode
text = tokenizer.decode(token_ids, skip_special_tokens=True)
# "Hello world"# Get vocab size
vocab_size = len(tokenizer) # 50,257 for GPT-2