Print-friendly | Download PDF | Save for Exam Day
This professional-level cheat sheet covers all 5 exam domains with optimization formulas, parallelism strategies, and deployment patterns. Based on official NVIDIA exam guide (26/22/22/18/12 weighting).
Domain 1: Model Optimization & Deployment (26%)
TensorRT-LLM Optimization Pipeline
Model → Export → Quantize → Build Engine → Deploy
↓ ↓ ↓ ↓
ONNX/HF INT8/FP16 TRT Engine Triton/NIM
Key Optimization Steps:
- Model Export: Convert PyTorch/HF to TensorRT-LLM format
- Quantization: Apply INT8/FP16/INT4 quantization
- Engine Build: Compile optimized inference engine
- Calibration: Run calibration dataset for INT8
- Deployment: Serve via Triton or NIM containers
Quantization Quick Reference
| Precision | Bits | Memory Reduction | Latency Improvement | Accuracy Impact |
|---|---|---|---|---|
| FP32 | 32 | Baseline | Baseline | None |
| FP16 | 16 | 50% | 1.5-2x faster | Minimal (<1%) |
| INT8 | 8 | 75% | 2-4x faster | Low (1-3%) |
| INT4 | 4 | 87.5% | 3-5x faster | Moderate (3-5%) |
When to Use Each:
- FP16: Default production choice, minimal accuracy loss
- INT8: When latency is critical, can tolerate slight accuracy drop
- INT4: Extreme memory constraints, edge deployment
Calibration Methods for INT8
| Method | How It Works | Best For |
|---|---|---|
| Min-Max | Uses min/max values from calibration data | Simple, fast |
| Entropy | Minimizes KL divergence | Better accuracy |
| Percentile | Uses percentile of activation distribution | Outlier robustness |
Calibration Dataset Requirements:
- 100-1000 representative samples
- Cover expected input distribution
- Include edge cases for robustness
Latency Optimization Techniques
Optimization Techniques
| Technique | Impact | Implementation Complexity |
|---|---|---|
| Quantization (INT8) | 2-4x latency reduction | Medium - requires calibration |
| KV-Cache Optimization | 1.5-2x for long sequences | Low - built into TensorRT-LLM |
| Continuous Batching | 2-3x throughput | Medium - Triton/NIM config |
| Speculative Decoding | 1.5-2x for generation | High - requires draft model |
| Flash Attention | 1.5-2x, less memory | Low - library swap |
| Tensor Parallelism | Linear scaling (GPUs) | Medium - model sharding |
Accuracy-Latency Trade-off Decision Tree
Latency Target | Precision | Expected Accuracy
------------------|-----------|------------------
< 50ms (real-time)| INT4/INT8 | 90-95% baseline
50-100ms | INT8/FP16 | 95-98% baseline
100-200ms | FP16 | 99%+ baseline
> 200ms | FP32 | 100% baseline
Preparing for NCP-GENL? Practice with 455+ exam questions
Domain 2: GPU Acceleration & Distributed Training (22%)
Parallelism Strategies Overview
Total Parallelism
Parallelism Strategy Selection
| Model Size | Best Strategy | Configuration Example |
|---|---|---|
| < 10B | Data Parallelism | DP=8, TP=1, PP=1 |
| 10-70B | Data + Tensor | DP=4, TP=4, PP=1 |
| 70-175B | All Three | DP=2, TP=4, PP=4 |
| > 175B | Heavy TP + PP | DP=1, TP=8, PP=8 |
Decision Factors:
- Memory-bound: Increase TP (splits attention heads)
- Compute-bound: Increase DP (more batch parallel)
- Very Deep: Use PP (split layers sequentially)
DeepSpeed ZeRO Stages
| Stage | What's Sharded | Memory Savings | Communication |
|---|---|---|---|
| ZeRO-1 | Optimizer states | ~4x | Low |
| ZeRO-2 | + Gradients | ~8x | Medium |
| ZeRO-3 | + Parameters | ~N× (N=GPUs) | High |
When to Use:
- ZeRO-1: Default for multi-GPU, minimal overhead
- ZeRO-2: Memory constrained, acceptable comm overhead
- ZeRO-3: Extreme memory needs, willing to trade speed
GPU Memory Calculation
Training Memory Estimate
Gradient Checkpointing
Trade-off: Memory ↔ Compute
Without checkpointing: Store all activations → High memory
With checkpointing: Store checkpoints, recompute between → 30-50% memory reduction, 20-30% compute increase
When to Use:
- Training very large models on limited memory
- Enabling larger batch sizes
- Not latency-critical training
NCCL Communication Optimization
| Collective | Use Case | Optimization |
|---|---|---|
| AllReduce | Gradient aggregation | Ring topology for bandwidth |
| AllGather | ZeRO-3 param collection | Tree for latency |
| ReduceScatter | Gradient sharding | Ring for bandwidth |
| Broadcast | Weight distribution | Tree for latency |
Tips:
- Use NVLink for intra-node (8x A100)
- Use InfiniBand for inter-node
- Overlap communication with computation
Domain 3: Fine-Tuning & Data Preparation (22%)
Fine-Tuning Method Selection
| Method | Trainable Params | Memory | Use Case |
|---|---|---|---|
| Full Fine-Tuning | 100% | Very High | Significant domain shift |
| LoRA | 0.1-1% | Low | Most use cases |
| QLoRA | 0.1-1% | Very Low | Memory constrained |
| Adapters | 1-5% | Low | Multi-task learning |
| Prefix Tuning | <0.1% | Minimal | Prompting enhancement |
LoRA Configuration
LoRA Rank Impact
LoRA Configuration Guidelines:
| Scenario | Rank | Alpha | Target Modules |
|---|---|---|---|
| Light adaptation | 4-8 | 8-16 | Q, V only |
| Moderate adaptation | 16-32 | 32-64 | Q, K, V, O |
| Heavy adaptation | 32-64 | 64-128 | Q, K, V, O, FFN |
QLoRA Specifics
QLoRA = LoRA + 4-bit Quantization + Double Quantization
Base Model: 4-bit NormalFloat (NF4) quantization
LoRA Weights: FP16/BF16
Gradients: Computed in BF16
Memory Savings:
- 7B model full FT: ~120GB
- 7B model LoRA: ~28GB
- 7B model QLoRA: ~10GB
Data Preparation Best Practices
Tokenization Selection:
| Tokenizer | Used By | Vocabulary Size | Best For |
|---|---|---|---|
| BPE | GPT, Llama | 32K-50K | English-centric |
| SentencePiece | T5, mT5 | 32K-100K | Multilingual |
| WordPiece | BERT | 30K | Classification |
Data Quality Checklist:
- Remove duplicates (dedup at document level)
- Filter low-quality text (perplexity filtering)
- Balance domains (prevent catastrophic forgetting)
- Validate format consistency
- Check for PII and sensitive content
Preventing Catastrophic Forgetting
| Strategy | How It Works |
|---|---|
| Data Mixing | Include 5-10% general data during fine-tuning |
| Low Learning Rate | 1e-5 to 1e-6 (10x lower than pre-training) |
| EWC | Add regularization on important weights |
| LoRA | Freeze base model, only train adapters |
Domain 4: LLM Foundations & Prompting (18%)
Attention Mechanism Variants
| Variant | Memory | Compute | Use Case |
|---|---|---|---|
| MHA (Multi-Head) | O(n²h) | Baseline | Standard transformers |
| MQA (Multi-Query) | O(n²) | Lower | Inference optimization |
| GQA (Grouped-Query) | O(n²g) | Balanced | Production LLMs (Llama 2) |
Key Insight: MQA shares K,V across heads → 8-16x less KV-cache memory
Positional Encoding Comparison
| Encoding | Max Length | Extrapolation | Training |
|---|---|---|---|
| Absolute (Sinusoidal) | Fixed | Poor | Simple |
| Learned | Fixed | Poor | Data-hungry |
| RoPE | Flexible | Good | Complex |
| ALiBi | Flexible | Good | Simple |
Production Recommendation: RoPE (Llama) or ALiBi for long context
Prompt Engineering Techniques
| Technique | When to Use | Token Cost |
|---|---|---|
| Zero-shot | Model is capable, task is clear | Lowest |
| Few-shot | Need examples, <5 fit in context | Medium |
| Chain-of-Thought | Reasoning tasks, math, logic | High |
| Self-Consistency | Need robust answers | Very High |
| ReAct | Tool use, multi-step tasks | High |
Constrained Decoding
Output Control Methods:
- Logit Bias: Adjust token probabilities
- Grammar Constraints: Force JSON/XML format
- Stop Sequences: End generation at markers
- Top-K/Top-P: Control sampling diversity
Master These Concepts with Practice
Our NCP-GENL practice bundle includes:
- 7 full practice exams (455+ questions)
- Detailed explanations for every answer
- Domain-by-domain performance tracking
30-day money-back guarantee
Domain 5: Evaluation, Monitoring & Safety (12%)
Evaluation Metrics Quick Reference
| Metric | Measures | Formula/Range |
|---|---|---|
| Perplexity | Language modeling | PPL = exp(avg cross-entropy), lower is better |
| BLEU | Translation quality | 0-100, higher is better |
| ROUGE | Summarization | 0-1, higher is better |
| BERTScore | Semantic similarity | -1 to 1, higher is better |
| Pass@k | Code generation | % of k samples that pass tests |
Perplexity
Production Monitoring Metrics
Latency Metrics:
- P50, P95, P99 latency
- Time to first token (TTFT)
- Tokens per second
Quality Metrics:
- User feedback scores
- Hallucination rate
- Task success rate
System Metrics:
- GPU utilization
- Memory usage
- Throughput (req/sec)
Bias Detection Approaches
| Approach | Method | Automation |
|---|---|---|
| Demographic Parity | Equal positive rates across groups | Automated |
| Counterfactual Testing | Swap attributes, check output changes | Semi-automated |
| Red-Teaming | Adversarial probing | Manual |
| Toxicity Scoring | Perspective API, Detoxify | Automated |
Safety Guardrails Implementation
User Input → Input Filter → LLM → Output Filter → Response
↓ ↓
Block harmful Block harmful
prompts generations
Common Guardrail Layers:
- Input: Block prompt injections, jailbreaks
- System: Limit capabilities, context length
- Output: Filter toxic/harmful content
- Post-hoc: Human review for edge cases
Quick Reference Tables
Model Size to Hardware Mapping
| Model Size | Inference (FP16) | Training (Full) | Training (LoRA) |
|---|---|---|---|
| 7B | 1x A100 80GB | 4x A100 | 1x A100 |
| 13B | 1x A100 80GB | 8x A100 | 1-2x A100 |
| 70B | 4x A100 80GB | 32x A100 | 4x A100 |
| 175B | 8x A100 80GB | 128+ A100 | 8x A100 |
Common Optimization Combinations
| Scenario | Recommended Stack |
|---|---|
| Low latency API | TensorRT-LLM + INT8 + Continuous batching |
| High throughput | vLLM + FP16 + Large batch |
| Memory constrained | QLoRA + INT4 + Gradient checkpointing |
| Maximum quality | FP16 + Full fine-tuning + Ensemble |
Training Hyperparameters (Typical Ranges)
| Parameter | Pre-training | Fine-tuning | LoRA |
|---|---|---|---|
| Learning Rate | 1e-4 to 3e-4 | 1e-5 to 5e-5 | 1e-4 to 3e-4 |
| Warmup Steps | 1000-2000 | 100-500 | 50-100 |
| Batch Size | 2M-4M tokens | 32-256 | 8-64 |
| Epochs | 1-3 | 1-5 | 1-3 |
Exam Day Quick Reminders
Optimization Domain (26%):
- TensorRT-LLM is NVIDIA's flagship inference optimizer
- INT8 needs calibration; INT4 uses NF4 format
- Continuous batching maximizes throughput
GPU Domain (22%):
- ZeRO-3 is most memory efficient but highest communication
- TP splits attention heads across GPUs
- PP introduces pipeline bubbles (use micro-batches)
Fine-Tuning Domain (22%):
- LoRA: rank 8-16 works for most tasks
- QLoRA: 4-bit base + FP16 LoRA adapters
- Always include some general data to prevent forgetting
Foundations Domain (18%):
- MQA/GQA reduce KV-cache memory
- RoPE enables length extrapolation
- CoT prompting helps reasoning tasks
Safety Domain (12%):
- Lower perplexity = better language modeling
- ROUGE for summarization, BLEU for translation
- Input AND output filters for production
Good luck on your NCP-GENL exam!
Sources
- NVIDIA TensorRT-LLM Documentation
- NVIDIA NCP-GENL Certification
- DeepSpeed ZeRO Documentation
- LoRA: Low-Rank Adaptation Paper
Last updated: February 8, 2026
Ready to Pass the NCP-GENL Exam?
Join thousands who passed with Preporato practice tests
