NCA-GENMNVIDIAGenerative AIMultimodal AICertification

NCA-GENM Cheat Sheet 2026: Quick Reference for Exam Day

Preporato TeamApril 20, 202612 min readNCA-GENM

Print-friendly | Download PDF | Save for Exam Day

This comprehensive cheat sheet covers all 7 NCA-GENM exam domains with formulas, architecture details, comparison tables, and key facts. Based on the official NVIDIA exam blueprint (25/20/15/15/10/10/5 weighting).

Domain 1: Experimentation (25%)

Hands-on

Evaluation Metrics Quick Reference

Metric	Measures	Scale	Formula Intuition	When to Use
FID	Quality + diversity	Lower = better	Distance between real and generated image distributions	Comparing two generators overall
IS	Quality + diversity	Higher = better	Classifier confidence on generated images	Quick quality check
CLIP Score	Text-image alignment	Higher = better	Cosine similarity of CLIP embeddings	Does image match prompt?
BLEU	N-gram overlap	0-1, higher = better	Precision of n-gram matches with reference	Caption evaluation
CIDEr	Caption consensus	Higher = better	TF-IDF weighted n-gram similarity	Best for captioning
METEOR	Word alignment	0-1, higher = better	Precision + recall with synonyms	Captioning with paraphrases

FID (Frechet Inception Distance)

Diffusion Model Hyperparameters

Parameter	What It Controls	Default/Typical	Low Value Effect	High Value Effect
Guidance Scale	Prompt adherence vs diversity	7.5	Creative, diverse, less accurate	Strong adherence, may over-saturate
Inference Steps	Quality vs speed	30-50	Faster but lower quality	Better quality, diminishing returns
Scheduler	Denoising trajectory	DDPM	Baseline, slow	DDIM/DPM-Solver: fewer steps needed
Seed	Reproducibility	Random	N/A	Same seed = same output
Negative Prompt	What NOT to generate	Empty	No constraints	Removes unwanted features

Guidance Scale Cheat Sheet

Scale 1.0  → No guidance (unconditional generation)
Scale 3-5  → Creative, diverse, less prompt-accurate
Scale 7-8  → Balanced (default for most use cases)
Scale 10-12 → Strong prompt adherence
Scale 15+  → Over-saturated, artifacts likely — avoid

Scheduler Comparison

Scheduler	Steps Needed	Speed	Quality	Notes
DDPM	1000	Slowest	High	Original, rarely used for inference
DDIM	20-50	Fast	Good	Deterministic sampling option
Euler	20-30	Fast	Good	Simple, effective
DPM-Solver++	15-25	Fastest	Good	State-of-the-art efficiency
UniPC	10-20	Very fast	Good	Unified predictor-corrector

Fine-Tuning Methods

Method	Trainable Params	Data Needed	Best For	Compute
LoRA	<1%	50-500 images	Style transfer	Low
DreamBooth	~All (+ prior)	3-10 images	Subject personalization	High
Textual Inversion	1 embedding	5-15 images	Lightweight concept learning	Very low
Full Fine-Tuning	100%	1000+ images	Domain adaptation	Very high

Prompt Engineering for Image Generation

Effective prompt structure:

[Subject] [Details] [Style] [Quality modifiers]
Example: "A golden retriever puppy, sitting in a field of wildflowers,
oil painting style, soft lighting, highly detailed, 4k"

Negative prompt examples:

"blurry, low quality, distorted, deformed, watermark, text,
oversaturated, bad anatomy, extra limbs"

Prompt weighting (Stable Diffusion syntax):

(important concept:1.3)  → 30% more emphasis
(less important:0.7)     → 30% less emphasis

Preparing for NCA-GENM? Practice with 455+ exam questions

Try Free View Bundle - $19.99

Domain 2: Core ML and AI Knowledge (20%)

Hands-on

Vision Transformer (ViT) Architecture

Processing Pipeline:

Image (224×224) → Split into patches (14×14 grid of 16×16 patches)
→ Linear projection (each patch → embedding vector)
→ Add position embeddings (learnable, not sinusoidal)
→ Prepend CLS token
→ Transformer encoder (self-attention across all patches)
→ CLS token output → Classification head

Key Numbers:

ViT-Base: 86M params, 12 layers, 768 hidden dim, 12 heads
ViT-Large: 307M params, 24 layers, 1024 hidden dim, 16 heads
Standard patch size: 16×16 pixels
Standard input: 224×224 → 196 patches + 1 CLS token = 197 tokens

CLIP Architecture

Dual Encoder Structure:

Text → Text Encoder (Transformer) → Text Embedding (512-d)
                                                          → Cosine Similarity
Image → Image Encoder (ViT/ResNet) → Image Embedding (512-d)

Contrastive Training:

Batch of N text-image pairs
N correct matches (diagonal) should have HIGH similarity
N² - N incorrect matches (off-diagonal) should have LOW similarity
Loss: symmetric cross-entropy on the similarity matrix

Zero-Shot Classification:

1. Create text prompts: "a photo of a [class]" for each class
2. Encode all text prompts → text embeddings
3. Encode the image → image embedding
4. Compute cosine similarity between image and each text embedding
5. Predicted class = highest similarity

CLIP Key Facts:

Does NOT generate images — only aligns text and image representations
Trained on 400M text-image pairs from the internet
Enables zero-shot transfer without any fine-tuning
CLIP Score uses this model to evaluate text-image alignment

Diffusion Models

Forward Process (Adding Noise):

x_0 (clean image) → x_1 → x_2 → ... → x_T (pure Gaussian noise)
Each step: x_t = √(α_t) × x_{t-1} + √(1 - α_t) × ε
where ε ~ N(0, I)

Reverse Process (Learned Denoising):

x_T (noise) → x_{T-1} → ... → x_1 → x_0 (generated image)
Neural network learns: ε_θ(x_t, t) ≈ ε (predict the noise)

Latent Diffusion Model (Stable Diffusion) Pipeline

Text Prompt → CLIP Text Encoder → Text Embeddings
                                        ↓ (cross-attention)
Random Noise → [U-Net Denoiser × N steps] → Denoised Latent
                                                    ↓
                                        VAE Decoder → Generated Image

Key Components:

VAE Encoder: Compresses 512×512 image → 64×64 latent (8× spatial reduction)
U-Net: Predicts noise in latent space, conditioned on text via cross-attention
CLIP Text Encoder: Converts text prompt to embedding vectors
VAE Decoder: Converts denoised latent back to pixel space
Scheduler: Controls the denoising trajectory (step size, noise removal rate)

Cross-Attention vs Self-Attention

Feature	Self-Attention	Cross-Attention
Q, K, V source	All from same input	Q from one modality, K/V from another
Purpose	Model relationships within a sequence	Fuse information across modalities
In Stable Diffusion	Image latent attends to itself	Image latent (Q) attends to text (K, V)
Result	Spatial coherence in image	Text conditioning of image generation

Variational Autoencoder (VAE)

Architecture:

Input Image → Encoder → μ, σ (mean, std of latent distribution)
→ Reparameterization: z = μ + σ × ε (ε ~ N(0,1))
→ Decoder → Reconstructed Image

Architecture	Type	Input Modalities	Key Use Case	Examples
ViT	Encoder	Image	Classification, features	ViT-B/16, DINOv2
CLIP	Dual Encoder	Text + Image	Alignment, retrieval	OpenCLIP, SigLIP
Stable Diffusion	Diffusion + VAE	Text → Image	Image generation	SD 2.1, SDXL
DALL-E	Diffusion/Autoregressive	Text → Image	Image generation	DALL-E 3
LLaVA	Vision-Language	Image + Text → Text	Visual QA, captioning	LLaVA-1.5, LLaVA-NeXT
Whisper	Encoder-Decoder	Audio → Text	Speech recognition	Whisper Large-v3

Step	Operation	Typical Values	Purpose
1. Resize	Resize to model size	224×224 (ViT), 512×512 (SD)	Match expected input
2. Center Crop	Crop center region	Square crop	Remove borders
3. Normalize	Scale pixel values	ImageNet: mean=[0.485, 0.456, 0.406]	Standardize input
4. To Tensor	Convert to tensor	Channel-first (C, H, W)	Framework compatibility

Augmentation	Safe for Multimodal?	Risk
Horizontal Flip	Usually safe	Breaks "left/right" descriptions
Random Crop	Safe if subject visible	May crop out described objects
Color Jitter (mild)	Usually safe	Breaks color-specific descriptions
Color Jitter (strong)	Unsafe	"Red car" becomes blue
Rotation (small)	Safe	Breaks orientation descriptions
Vertical Flip	Usually unsafe	Breaks gravity, spatial relations
Cutout/Erasing	Risky	May remove described objects

Format	Description	Dimensions	Used By
Waveform	Raw amplitude over time	1D signal	WaveNet, SoundStream
Spectrogram	Time × Frequency	2D (like image)	General audio models
Mel Spectrogram	Perceptual frequency scale	2D (like image)	Whisper, audio transformers
MFCC	Compact spectral features	Feature vectors	Traditional speech

Tool	What It Does	When to Use It
NeMo	Full framework for building and training AI models	Custom multimodal model development
Picasso	Cloud service for visual content generation	Enterprise image/video generation
NIM	Pre-optimized model deployment microservices	Quick production deployment
Triton	High-performance inference server	Multi-model serving at scale
TensorRT	Inference optimization engine	Maximum GPU inference speed

Technique	What It Shows	How to Create	Interpretation
Attention Maps	Where model looks in image	Extract attention weights, overlay on image	Bright regions = high attention
Grad-CAM	Class-relevant image regions	Gradient-weighted activation maps	Highlights decision-relevant areas
t-SNE	Embedding clusters in 2D	Apply t-SNE to CLIP embeddings	Similar items cluster together
UMAP	Embedding structure in 2D	Faster alternative to t-SNE	Preserves global structure better
Training Curves	Loss over time	Plot loss vs steps/epochs	Should decrease smoothly
FID over Training	Generation quality over time	Compute FID periodically	Should decrease, then plateau

Symptom	Diagnosis	Fix
Training loss drops, val loss rises	Overfitting	More data, regularization, early stopping
Both losses remain high	Underfitting	More capacity, more training, lower LR
Loss spikes or NaN	Divergence	Lower learning rate, gradient clipping
Loss plateaus very early	Learning rate too low	Increase LR, use warmup schedule
FID stops improving	Training saturated	Stop training, adjust architecture

Precision	Bytes/Param	Memory (1B)	Speed vs FP32	Quality Loss
FP32	4	4 GB	1x (baseline)	None
FP16	2	2 GB	~2x	Negligible
BF16	2	2 GB	~2x	Negligible (better for training)
INT8	1	1 GB	~3-4x	Minimal with calibration
INT4	0.5	0.5 GB	~4-6x	Noticeable, needs calibration

Technique	Speedup	Quality Impact	Difficulty
FP16 inference	~2x	Negligible	Easy
Fewer steps (50→25)	~2x	Mild	Easy
Fast scheduler (DPM++)	~2-3x fewer steps	Minimal	Easy
TensorRT compilation	~2-3x	None	Medium
Step distillation	~4-8x	Requires distilled model	Hard
Consistency models	~1-4 steps	Slight quality trade-off	Hard

Strategy	What It Does	When to Use
Dynamic Batching	Groups incoming requests	Multiple concurrent users
Model Caching	Keeps model in GPU memory	Repeated model use
Request Queuing	Manages request overflow	High traffic
Horizontal Scaling	Multiple GPU instances	Exceeding single-GPU capacity
Async Processing	Non-blocking generation	Long-running tasks (image gen)

Concept	Definition	Example
Visual Bias	Model generates stereotypical images	"CEO" generates only white male images
Content Safety	Filtering harmful generated content	NSFW classifier before output delivery
Watermarking	Invisible markers in generated images	Proves AI origin for provenance
Deepfakes	Realistic fake face/video generation	Used for misinformation, fraud
Data Provenance	Tracking training data sources	Important for IP compliance
Consent	Permission for data usage	Especially for face images

Method	Visibility	Robustness	Use Case
Visible watermark	User sees it	Easy to remove	Previews, demos
Invisible watermark	Imperceptible	Survives crops, compression	Production, provenance
Metadata embedding	In file metadata	Easy to strip	Basic tracking
Spectral watermark	In frequency domain	High robustness	Research, verification

Fact	Value
ViT-Base patch size	16×16 pixels
ViT-Base input size	224×224 pixels
ViT-Base patches	196 (14×14 grid)
CLIP training data	400M text-image pairs
CLIP embedding dim	512
Stable Diffusion latent size	64×64 (from 512×512 input)
Typical guidance scale	7.5
Typical inference steps	20-50
FP16 memory savings	2× over FP32
INT8 memory savings	4× over FP32
NCA-GENM exam time	60 minutes
NCA-GENM questions	50-60
NCA-GENM cost	$125
NCA-GENM validity	2 years

Domain 1: Experimentation (25%)

Evaluation Metrics Quick Reference

FID (Frechet Inception Distance)

FID Formula

Diffusion Model Hyperparameters

Guidance Scale Cheat Sheet

Scheduler Comparison

Fine-Tuning Methods

Prompt Engineering for Image Generation

Domain 2: Core ML and AI Knowledge (20%)

Vision Transformer (ViT) Architecture

ViT Patch Count

CLIP Architecture

Diffusion Models

Diffusion Training Objective

Latent Diffusion Model (Stable Diffusion) Pipeline

Cross-Attention vs Self-Attention

Variational Autoencoder (VAE)

VAE Loss

Model Architecture Comparison

Domain 3: Multimodal Data (15%)

Image Preprocessing Pipeline

Data Augmentation Safety Guide

Audio Data Representations

Dataset Quality Checklist

Domain 4: Software Development (15%)

NVIDIA Tools Reference

Hugging Face Diffusers Key API

CLIP Usage Pattern

Tool Selection Decision Tree

Master These Concepts with Practice

Domain 5: Data Analysis and Visualization (10%)

Visualization Techniques

Training Curve Diagnosis

Embedding Space Quality Indicators

Domain 6: Performance Optimization (10%)

Quantization Reference

Diffusion Speed Optimization

Memory Estimation

Serving Optimization

Domain 7: Trustworthy AI (5%)

Key Concepts

Bias Detection Methods

Content Safety Pipeline

Watermarking Methods

Quick Reference: Numbers to Remember

Exam Day Quick Tips

Ready to Pass the NCA-GENM Exam?

More NCA-GENM Articles

NCA-GENM Complete Guide 2026 — NVIDIA Generative AI Multimodal Certification

How to Pass NCA-GENM on Your First Attempt (2026 Tips)

NCA-GENM Exam Domains 2026: Weights, Topics & Study Strategy