Preporato
NCA-GENLNVIDIA Certified Associate - Generative AI with LLMsNVIDIACheat SheetLLMQuick Reference

NVIDIA NCA-GENL Cheat Sheet: Complete LLM Reference [2026]

Preporato TeamJanuary 8, 202610 min readNCA-GENL

Print-friendly | Download PDF | Save for Exam Day

This comprehensive cheat sheet covers all 5 exam domains with formulas, comparison tables, and code snippets. Based on official NVIDIA exam guide (30/24/22/14/10 weighting).


Domain 1: Core ML & AI Knowledge (30%)

Transformer Architecture Overview

Self-Attention Visualization

Watch how Query (Q), Key (K), and Value (V) matrices interact through the attention mechanism. The animation shows the step-by-step process: creating Q/K/V from inputs, computing attention scores (Q×Kᵀ), applying softmax normalization, and multiplying by values to produce the final output.

Multi-Head Attention Visualization

See how multi-head attention processes information in parallel. The animation shows the input being split into 4 attention heads (each with different colors), each head computing attention independently, then concatenating the results and applying the final projection matrix (W^O) to produce the output.

Component Comparison Table

ComponentPurposeKey Details
Positional EncodingPreserve sequence orderSinusoidal: PE(pos,2i) = sin(pos/10000^(2i/d))
Self-AttentionModel token relationshipsComputes Q, K, V projections
Multi-Head AttentionDifferent representation subspacesTypically 8-16 heads, d_model/h per head
Feed-Forward NetworkPosition-wise transformation2 layers: d_model → 4×d_model → d_model
Layer NormalizationStabilize trainingNormalizes across features (not batch)
Residual ConnectionsEnable deep networksx + Sublayer(x) pattern

Model Architecture Types

TypeAttentionUse CasesExamples
Encoder-OnlyBidirectional (no masking)Classification, NER, embeddingsBERT, RoBERTa
Decoder-OnlyCausal/masked (autoregressive)Text generation, completionGPT-3/4, Llama 2/3
Encoder-DecoderBoth types + cross-attentionTranslation, summarizationT5, BART, mT5

Key Differences:

Attention Mechanism Steps

  1. Compute Scores: QK^T (query-key similarity)
  2. Scale: Divide by √d_k (stabilize gradients)
  3. Apply Mask (decoder only): Set future positions to -∞
  4. Softmax: Convert to probability distribution
  5. Weight Values: Multiply by V to get output

Attention Mechanism Calculator

Head Dimension (d_k)
64
Scaling Factor (√d_k)
8.000
Formula: d_k = d_model / h = 512 / 8 = 64
Scores scaled by: 1 / √64 = 8.000

Causal Masking Matrix (Decoder):

[[0, -∞, -∞, -∞],
 [0,  0, -∞, -∞],
 [0,  0,  0, -∞],
 [0,  0,  0,  0]]

LLM Training & Scaling

Training Stages:

  1. Pre-training: Next-token prediction on massive corpus (billions of tokens)
  2. Supervised Fine-Tuning (SFT): Task-specific adaptation
  3. RLHF (Optional): Reinforcement Learning from Human Feedback for alignment

Model Memory Calculator

Required GPU Memory
14.00 GB
For 7B parameter model at FP16

Context Window:


Preparing for NCA-GENL? Practice with 390+ exam questions

Domain 2: Experimentation (22%)

Prompt Engineering Techniques

TechniqueDescriptionToken OverheadUse Case
Zero-ShotDirect instruction, no examples10-50Simple, well-defined tasks
Few-Shot2-10 examples in prompt100-500Pattern learning, formatting
Chain-of-Thought (CoT)"Let's think step by step"50-200Reasoning, math problems
Self-ConsistencySample multiple CoT, vote500+Complex reasoning, verification
ReActReasoning + Acting cyclesVariableTool use, multi-step tasks

Prompt Structure Best Practice:

[System Role]
You are an expert data analyst.

[Context]
Dataset: {data_description}

[Task]
Analyze trends and provide insights on {specific_aspect}.

[Constraints]
- Use only provided data
- Cite sources with [Source: X]

[Format]
Respond in JSON: {"trends": [], "insights": []}

Token Estimation Rules:

Fine-Tuning Methods

MethodTrainable ParamsMemorySpeedBest For
Full Fine-Tuning100%HighestSlowestMaximum customization, large datasets
LoRA<1% (0.1-1%)LowFastMost tasks, limited GPU
QLoRA<1%LowestFastLarge models (65B+) on consumer GPU
Adapter Layers<10%MediumMediumMulti-task, modular approaches
Prefix Tuning<0.1%Very LowVery FastPrompt-like tuning

LoRA (Low-Rank Adaptation) Details

LoRA Parameter Efficiency Calculator

Parameter Reduction
99.6%
Full fine-tune: 16.78M params → LoRA: 0.07M params
Full Fine-Tune: 4096 × 4096 = 16.78M parameters
LoRA: 2 × 4096 × 8 = 0.07M parameters

Key Hyperparameters:

Hugging Face Implementation:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,                          # Rank
    lora_alpha=16,                # Scaling (2r rule)
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable: {trainable_params:,} ({trainable_params/7e9*100:.2f}%)")

QLoRA Optimization

Additional Features:

Memory Comparison (65B model):

FP16 full:        130 GB
4-bit quantized:   33 GB
4-bit + QLoRA:     35 GB (with adapters + optimizer states)

BitsAndBytes Implementation:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Evaluation Metrics

MetricRangeHigher Better?Formula/Notes
Perplexity[1, ∞)❌ LowerPPL = exp(-1/n × Σ log P(w_i|context))
BLEU[0, 1]✅ HigherN-gram precision (1-4 grams) with brevity penalty
ROUGE-L[0, 1]✅ HigherLongest Common Subsequence F1 score
BERTScore[0, 1]✅ HigherCosine similarity of BERT embeddings
METEOR[0, 1]✅ HigherHarmonic mean of precision/recall with synonyms

LLM Evaluation Metrics Quick Reference

Perplexity
Language modeling quality
Range:1-∞
Lower is:Better
Best for:LM evaluation
BLEU
N-gram overlap
Range:0-1
Higher is:Better
Best for:Translation
ROUGE-L
Longest common subsequence
Range:0-1
Higher is:Better
Best for:Summarization
BERTScore
Semantic similarity
Range:0-1
Higher is:Better
Best for:Semantic eval
METEOR
Precision + recall + synonyms
Range:0-1
Higher is:Better
Best for:MT evaluation

Perplexity (PPL)

Measures how well the model predicts the next token

Formula
PPL = exp(cross_entropy_loss)
Interpretation Guide
< 10
Excellent
Very confident predictions
10-30
Good
Solid performance
30-50
Acceptable
Domain-dependent
> 50
Poor
High uncertainty

BLEU Score

N-gram overlap for translation quality

Formula
BLEU = BP × exp(Σ log p_n)
Interpretation Guide
< 0.2
Poor
Needs improvement
0.2-0.3
Acceptable
Understandable
0.3-0.5
Good
High quality
> 0.5
Very Good
Near-human quality

ROUGE-L Score Ranges

Quality interpretation for summarization tasks

Interpretation Guide
< 0.2
Poor
Low overlap
0.2-0.3
Acceptable
Basic coverage
0.3-0.5
Good
Strong overlap
> 0.5
Excellent
Very high overlap

Domain 3: Software Development (24%)

NVIDIA Platform Tools

NVIDIA NIM (Inference Microservices):

# Deploy optimized LLM inference
docker run -it --gpus all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3-8b-instruct:latest

# Test endpoint (OpenAI-compatible)
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'

Key Features:

Triton Inference Server:

# config.pbtxt
name: "llama-3-8b"
platform: "pytorch_libtorch"
max_batch_size: 8

dynamic_batching {
  preferred_batch_size: [4, 8]
  max_queue_delay_microseconds: 100
}

instance_group [
  { count: 1, kind: KIND_GPU }
]

Use Cases:

LangChain Essentials

Basic Chain:

from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

template = "Summarize in 3 sentences: {text}"
prompt = PromptTemplate(template=template, input_variables=["text"])

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(text="Long document...")

Common Chain Types:

ChainPurposeExample
LLMChainSingle LLM call with promptQ&A, classification
SequentialChainChain multiple LLM callsMulti-step reasoning
RetrievalQARAG pipelineKnowledge-grounded QA
ConversationalRetrievalChainRAG with chat historyChatbots
AgentExecutorDynamic tool selectionTask automation

Agent with Tools:

from langchain.agents import initialize_agent, Tool

tools = [
    Tool(name="Search", func=search_fn, description="Search the web"),
    Tool(name="Calculator", func=calc_fn, description="Do math")
]

agent = initialize_agent(
    tools, llm, agent="zero-shot-react-description", verbose=True
)
agent.run("What is 25% of the GDP of France?")

Hugging Face Transformers

Load and Generate:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Generate
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Generation Parameters:

ParameterEffectTypical Values
temperatureRandomness (higher = more random)0.7-1.0
top_pNucleus sampling (cumulative prob)0.9-0.95
top_kSample from top K tokens40-50
max_lengthMaximum tokens to generateTask-dependent
num_beamsBeam search width (1 = greedy)1-5

Domain 4: RAG Architecture & Data (14%)

RAG Pipeline

Query → Embed → Vector Search → Retrieve Top-K → Augment Prompt → LLM → Response

Component Options:

ComponentOptionsBest Choice
Embedding ModelOpenAI Ada-002, sentence-transformersAda-002: general, S-T: domain-specific
Vector DBPinecone, Weaviate, ChromaDB, FAISSPinecone: managed, FAISS: local
ChunkingFixed (512), Semantic, RecursiveSemantic: best context
RetrievalSemantic only, Hybrid (semantic + BM25)Hybrid: best accuracy

Chunking Strategies

Fixed-Size Chunking:

chunk_size = 512      # tokens
overlap = 50          # 10-20% of chunk_size

# Rule of thumb:
# chunk_size: 256-512 tokens (balance context vs specificity)
# overlap: preserve context across chunks

Best Practices:

Semantic Chunking:

# Split by topic/section using embeddings
# 1. Compute sentence embeddings
# 2. Find semantic breaks (cosine similarity drops)
# 3. Create chunks at break points

Retrieval Optimization

Similarity Metrics:

Retrieval Parameters:

top_k = 3-5                      # Number of chunks to retrieve
similarity_threshold = 0.7       # Minimum similarity to include

Reranking:

1. Retrieve top-20 candidates (fast semantic search)
2. Rerank with cross-encoder (more accurate but slower)
3. Return top-3 for LLM context

RAPIDS & cuDF

cuDF (GPU-accelerated pandas):

import cudf

# Load data on GPU
df = cudf.read_csv('data.csv')

# Operations 10-50x faster than pandas
filtered = df[df['score'] > 0.5]
grouped = df.groupby('category')['value'].mean()

When to Use RAPIDS:

RAPIDS Tools:

ToolPurposeCPU Equivalent
cuDFDataFrame operationspandas
cuMLMachine learningscikit-learn
cuGraphGraph analyticsNetworkX
cuPyArray operationsNumPy

Tokenization Methods

MethodUsed ByVocabularyStrengths
BPEGPT-2/3/450KEfficient, handles rare words
WordPieceBERT30KMaximizes training likelihood
SentencePieceT5, LlamaVariableLanguage-agnostic, no pre-tokenization
UnigramXLNetVariableProbabilistic subwords

BPE Example:

"unhappiness" → ["un", "happiness"]
"transformer" → ["transform", "er"]

Master These Concepts with Practice

Our NCA-GENL practice bundle includes:

Try 15 Free QuestionsGet Full Access - $19.99

30-day money-back guarantee

Domain 5: Trustworthy AI (10%)

Common Risks & Mitigations

RiskMitigationImplementation
HallucinationsRAG, citations, verificationGround in knowledge base
BiasDiverse data, fairness testingTest across demographics
PII LeakageInput/output filteringRegex + NER models
Prompt InjectionInput validation, sandboxingDetect malicious patterns
Toxic ContentContent moderationPerspective API, classifiers

Content Filtering

PII Detection:

import re

pii_patterns = {
    'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
    'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
    'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'
}

def detect_pii(text):
    found = {}
    for pii_type, pattern in pii_patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            found[pii_type] = matches
    return found

Toxicity Detection:

from transformers import pipeline

classifier = pipeline("text-classification",
                      model="unitary/toxic-bert")
result = classifier("Your text here")
# Output: [{'label': 'toxic', 'score': 0.02}]

# Filter if score > 0.7
if result[0]['score'] > 0.7:
    return "Content filtered due to toxicity"

Bias Detection

Fairness Metrics:

Testing Approach:

# 1. Create test sets for each demographic
test_sets = {
    'male': male_examples,
    'female': female_examples,
    'non_binary': nb_examples
}

# 2. Measure metrics per group
metrics = {}
for group, examples in test_sets.items():
    metrics[group] = {
        'accuracy': calculate_accuracy(model, examples),
        'f1': calculate_f1(model, examples)
    }

# 3. Flag if disparity > 5%
max_disparity = max(metrics.values()) - min(metrics.values())
if max_disparity > 0.05:
    print(f"⚠️ Bias detected: {max_disparity:.1%} disparity")

Hallucination Prevention

Strategies:

  1. RAG: Ground responses in retrieved documents
  2. Citations: Require model to cite sources
  3. Confidence Scores: Filter low-confidence outputs
  4. Verification: Cross-check facts with knowledge base
  5. Temperature: Lower temperature (0.3-0.5) reduces creativity/hallucination

Citation Enforcement:

prompt = """
Use ONLY the provided context to answer.
If unsure, say "I don't have enough information."
ALWAYS cite sources: [Source: document_name]

Context:
{retrieved_docs}

Question: {question}
"""

Quick Command Reference

Model Loading & Optimization

Flash Attention 2 (2-4x faster):

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Gradient Checkpointing (reduce memory):

model.gradient_checkpointing_enable()
# Trades compute for memory (20-30% slower, 30-40% less memory)

8-bit Quantization:

model = AutoModelForCausalLM.from_pretrained(
    "model_name",
    load_in_8bit=True,
    device_map="auto"
)
# Reduces memory by ~50% with minimal accuracy loss

Tokenizer Operations

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Encode
token_ids = tokenizer.encode("Hello world", add_special_tokens=True)
# [15496, 995] + special tokens

# Decode
text = tokenizer.decode(token_ids, skip_special_tokens=True)
# "Hello world"

# Get vocab size
vocab_size = len(tokenizer)  # 50,257 for GPT-2

Training Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # Effective batch: 4×4=16
    learning_rate=2e-5,
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    fp16=True,                         # Mixed precision (2x speedup)
    dataloader_num_workers=4
)

Exam Strategy Quick Tips

Time Management

Common Mistake Patterns

❌ Wrong✅ Right
"Self-attention reduces complexity""Self-attention captures long-range dependencies"
"LoRA updates all parameters""LoRA adds trainable low-rank matrices, freezes base"
"BLEU measures semantics""BLEU measures n-gram overlap, BERTScore is semantic"
"Encoder uses causal masking""Decoder uses causal masking, encoder is bidirectional"
"Higher perplexity is better""Lower perplexity is better (less surprised)"

Formula Quick Reference

Attention:        softmax(QK^T / √d_k) × V
Perplexity:       exp(-1/n × Σ log P(w_i|context))
LoRA:             W' = W + BA, trainable = 2dr
Memory (GB):      Params × Bytes_per_param / 1e9
Parameter count:  12 × n_layers × d_model²

Domain Coverage Checklist


Additional Resources

Official NVIDIA:

Technical References:

Practice:


Print Instructions: This cheat sheet is optimized for 2-page printing (front/back). Use landscape orientation for best results.

Download PDF: Click here to download printable PDF version (Coming soon)

Last Updated: January 8, 2026


Based on official NVIDIA NCA-GENL exam guide. All formulas and domain weights verified against official sources.

Sources:

Ready to Pass the NCA-GENL Exam?

Join thousands who passed with Preporato practice tests

Start Practicing Now - $19.99
Instant access30-day guaranteeUpdated monthly

More NCA-GENL Articles

View all articles →