NVIDIA NCP-GENL Cheat Sheet: Complete LLM Professional Reference [2026] | NCP-GENL Study Guide

Print-friendly | Download PDF | Save for Exam Day

This professional-level cheat sheet covers all 5 exam domains with optimization formulas, parallelism strategies, and deployment patterns. Based on official NVIDIA exam guide (26/22/22/18/12 weighting).

Domain 1: Model Optimization & Deployment (26%)

TensorRT-LLM Optimization Pipeline

Model → Export → Quantize → Build Engine → Deploy
        ↓         ↓           ↓            ↓
    ONNX/HF   INT8/FP16    TRT Engine   Triton/NIM

Key Optimization Steps:

Model Export: Convert PyTorch/HF to TensorRT-LLM format
Quantization: Apply INT8/FP16/INT4 quantization
Engine Build: Compile optimized inference engine
Calibration: Run calibration dataset for INT8
Deployment: Serve via Triton or NIM containers

Quantization Quick Reference

Precision	Bits	Memory Reduction	Latency Improvement	Accuracy Impact
FP32	32	Baseline	Baseline	None
FP16	16	50%	1.5-2x faster	Minimal (<1%)
INT8	8	75%	2-4x faster	Low (1-3%)
INT4	4	87.5%	3-5x faster	Moderate (3-5%)

When to Use Each:

FP16: Default production choice, minimal accuracy loss
INT8: When latency is critical, can tolerate slight accuracy drop
INT4: Extreme memory constraints, edge deployment

Method	How It Works	Best For
Min-Max	Uses min/max values from calibration data	Simple, fast
Entropy	Minimizes KL divergence	Better accuracy
Percentile	Uses percentile of activation distribution	Outlier robustness

Technique	Impact	Implementation Complexity
Quantization (INT8)	2-4x latency reduction	Medium - requires calibration
KV-Cache Optimization	1.5-2x for long sequences	Low - built into TensorRT-LLM
Continuous Batching	2-3x throughput	Medium - Triton/NIM config
Speculative Decoding	1.5-2x for generation	High - requires draft model
Flash Attention	1.5-2x, less memory	Low - library swap
Tensor Parallelism	Linear scaling (GPUs)	Medium - model sharding

Model Size	Best Strategy	Configuration Example
< 10B	Data Parallelism	DP=8, TP=1, PP=1
10-70B	Data + Tensor	DP=4, TP=4, PP=1
70-175B	All Three	DP=2, TP=4, PP=4
> 175B	Heavy TP + PP	DP=1, TP=8, PP=8

Stage	What's Sharded	Memory Savings	Communication
ZeRO-1	Optimizer states	~4x	Low
ZeRO-2	+ Gradients	~8x	Medium
ZeRO-3	+ Parameters	~N× (N=GPUs)	High

Collective	Use Case	Optimization
AllReduce	Gradient aggregation	Ring topology for bandwidth
AllGather	ZeRO-3 param collection	Tree for latency
ReduceScatter	Gradient sharding	Ring for bandwidth
Broadcast	Weight distribution	Tree for latency

Method	Trainable Params	Memory	Use Case
Full Fine-Tuning	100%	Very High	Significant domain shift
LoRA	0.1-1%	Low	Most use cases
QLoRA	0.1-1%	Very Low	Memory constrained
Adapters	1-5%	Low	Multi-task learning
Prefix Tuning	<0.1%	Minimal	Prompting enhancement

Scenario	Rank	Alpha	Target Modules
Light adaptation	4-8	8-16	Q, V only
Moderate adaptation	16-32	32-64	Q, K, V, O
Heavy adaptation	32-64	64-128	Q, K, V, O, FFN

Tokenizer	Used By	Vocabulary Size	Best For
BPE	GPT, Llama	32K-50K	English-centric
SentencePiece	T5, mT5	32K-100K	Multilingual
WordPiece	BERT	30K	Classification

Strategy	How It Works
Data Mixing	Include 5-10% general data during fine-tuning
Low Learning Rate	1e-5 to 1e-6 (10x lower than pre-training)
EWC	Add regularization on important weights
LoRA	Freeze base model, only train adapters

Variant	Memory	Compute	Use Case
MHA (Multi-Head)	O(n²h)	Baseline	Standard transformers
MQA (Multi-Query)	O(n²)	Lower	Inference optimization
GQA (Grouped-Query)	O(n²g)	Balanced	Production LLMs (Llama 2)

Encoding	Max Length	Extrapolation	Training
Absolute (Sinusoidal)	Fixed	Poor	Simple
Learned	Fixed	Poor	Data-hungry
RoPE	Flexible	Good	Complex
ALiBi	Flexible	Good	Simple

Technique	When to Use	Token Cost
Zero-shot	Model is capable, task is clear	Lowest
Few-shot	Need examples, <5 fit in context	Medium
Chain-of-Thought	Reasoning tasks, math, logic	High
Self-Consistency	Need robust answers	Very High
ReAct	Tool use, multi-step tasks	High

Metric	Measures	Formula/Range
Perplexity	Language modeling	PPL = exp(avg cross-entropy), lower is better
BLEU	Translation quality	0-100, higher is better
ROUGE	Summarization	0-1, higher is better
BERTScore	Semantic similarity	-1 to 1, higher is better
Pass@k	Code generation	% of k samples that pass tests

Approach	Method	Automation
Demographic Parity	Equal positive rates across groups	Automated
Counterfactual Testing	Swap attributes, check output changes	Semi-automated
Red-Teaming	Adversarial probing	Manual
Toxicity Scoring	Perspective API, Detoxify	Automated

Model Size	Inference (FP16)	Training (Full)	Training (LoRA)
7B	1x A100 80GB	4x A100	1x A100
13B	1x A100 80GB	8x A100	1-2x A100
70B	4x A100 80GB	32x A100	4x A100
175B	8x A100 80GB	128+ A100	8x A100

Scenario	Recommended Stack
Low latency API	TensorRT-LLM + INT8 + Continuous batching
High throughput	vLLM + FP16 + Large batch
Memory constrained	QLoRA + INT4 + Gradient checkpointing
Maximum quality	FP16 + Full fine-tuning + Ensemble

Parameter	Pre-training	Fine-tuning	LoRA
Learning Rate	1e-4 to 3e-4	1e-5 to 5e-5	1e-4 to 3e-4
Warmup Steps	1000-2000	100-500	50-100
Batch Size	2M-4M tokens	32-256	8-64
Epochs	1-3	1-5	1-3

Domain 1: Model Optimization & Deployment (26%)

TensorRT-LLM Optimization Pipeline

Quantization Quick Reference

Memory Savings from Quantization

Calibration Methods for INT8

Latency Optimization Techniques

Optimization Techniques

Accuracy-Latency Trade-off Decision Tree

Domain 2: GPU Acceleration & Distributed Training (22%)

Parallelism Strategies Overview

Total Parallelism

Parallelism Strategy Selection

DeepSpeed ZeRO Stages

GPU Memory Calculation

Training Memory Estimate

Gradient Checkpointing

NCCL Communication Optimization

Domain 3: Fine-Tuning & Data Preparation (22%)

Fine-Tuning Method Selection

LoRA Configuration

LoRA Rank Impact

QLoRA Specifics

Data Preparation Best Practices

Preventing Catastrophic Forgetting

Domain 4: LLM Foundations & Prompting (18%)

Attention Mechanism Variants

Positional Encoding Comparison

Prompt Engineering Techniques

Constrained Decoding

Master These Concepts with Practice

Domain 5: Evaluation, Monitoring & Safety (12%)

Evaluation Metrics Quick Reference

Perplexity

Production Monitoring Metrics

Bias Detection Approaches

Safety Guardrails Implementation

Quick Reference Tables

Model Size to Hardware Mapping

Common Optimization Combinations

Training Hyperparameters (Typical Ranges)

Exam Day Quick Reminders

Sources

Ready to Pass the NCP-GENL Exam?

More NCP-GENL Articles

NVIDIA NCP-GENL Certification: Complete Guide for 2026

How to Pass NVIDIA NCP-GENL on Your First Attempt [2026 Guide]

NCP-GENL Exam Domains: Complete Weight Breakdown & Study Guide [2026]