This comprehensive cheat sheet covers all 7 NCA-GENM exam domains with formulas, architecture details, comparison tables, and key facts. Based on the official NVIDIA exam blueprint (25/20/15/15/10/10/5 weighting).
Domain 1: Experimentation (25%)
Evaluation Metrics Quick Reference
Metric
Measures
Scale
Formula Intuition
When to Use
FID
Quality + diversity
Lower = better
Distance between real and generated image distributions
Comparing two generators overall
IS
Quality + diversity
Higher = better
Classifier confidence on generated images
Quick quality check
CLIP Score
Text-image alignment
Higher = better
Cosine similarity of CLIP embeddings
Does image match prompt?
BLEU
N-gram overlap
0-1, higher = better
Precision of n-gram matches with reference
Caption evaluation
CIDEr
Caption consensus
Higher = better
TF-IDF weighted n-gram similarity
Best for captioning
METEOR
Word alignment
0-1, higher = better
Precision + recall with synonyms
Captioning with paraphrases
FID (Frechet Inception Distance)
Diffusion Model Hyperparameters
Parameter
What It Controls
Default/Typical
Low Value Effect
High Value Effect
Guidance Scale
Prompt adherence vs diversity
7.5
Creative, diverse, less accurate
Strong adherence, may over-saturate
Inference Steps
Quality vs speed
30-50
Faster but lower quality
Better quality, diminishing returns
Scheduler
Denoising trajectory
DDPM
Baseline, slow
DDIM/DPM-Solver: fewer steps needed
Seed
Reproducibility
Random
N/A
Same seed = same output
Negative Prompt
What NOT to generate
Empty
No constraints
Removes unwanted features
Guidance Scale Cheat Sheet
Scale 1.0 → No guidance (unconditional generation)
Scale 3-5 → Creative, diverse, less prompt-accurate
Scale 7-8 → Balanced (default for most use cases)
Scale 10-12 → Strong prompt adherence
Scale 15+ → Over-saturated, artifacts likely — avoid
Scheduler Comparison
Scheduler
Steps Needed
Speed
Quality
Notes
DDPM
1000
Slowest
High
Original, rarely used for inference
DDIM
20-50
Fast
Good
Deterministic sampling option
Euler
20-30
Fast
Good
Simple, effective
DPM-Solver++
15-25
Fastest
Good
State-of-the-art efficiency
UniPC
10-20
Very fast
Good
Unified predictor-corrector
Fine-Tuning Methods
Method
Trainable Params
Data Needed
Best For
Compute
LoRA
<1%
50-500 images
Style transfer
Low
DreamBooth
~All (+ prior)
3-10 images
Subject personalization
High
Textual Inversion
1 embedding
5-15 images
Lightweight concept learning
Very low
Full Fine-Tuning
100%
1000+ images
Domain adaptation
Very high
Prompt Engineering for Image Generation
Effective prompt structure:
[Subject] [Details] [Style] [Quality modifiers]
Example: "A golden retriever puppy, sitting in a field of wildflowers,
oil painting style, soft lighting, highly detailed, 4k"
Negative prompt examples:
"blurry, low quality, distorted, deformed, watermark, text,
oversaturated, bad anatomy, extra limbs"
Prompt weighting (Stable Diffusion syntax):
(important concept:1.3) → 30% more emphasis
(less important:0.7) → 30% less emphasis
Preparing for NCA-GENM? Practice with 455+ exam questions
Text → Text Encoder (Transformer) → Text Embedding (512-d)
→ Cosine Similarity
Image → Image Encoder (ViT/ResNet) → Image Embedding (512-d)
Contrastive Training:
Batch of N text-image pairs
N correct matches (diagonal) should have HIGH similarity
N² - N incorrect matches (off-diagonal) should have LOW similarity
Loss: symmetric cross-entropy on the similarity matrix
Zero-Shot Classification:
1. Create text prompts: "a photo of a [class]" for each class
2. Encode all text prompts → text embeddings
3. Encode the image → image embedding
4. Compute cosine similarity between image and each text embedding
5. Predicted class = highest similarity
CLIP Key Facts:
Does NOT generate images — only aligns text and image representations
Trained on 400M text-image pairs from the internet
Enables zero-shot transfer without any fine-tuning
CLIP Score uses this model to evaluate text-image alignment
Resolution: Images meet minimum model requirements
Diversity: Balanced representation across categories
Deduplication: No exact or near-duplicate pairs
Filtering: NSFW, low-quality, and watermarked images removed
Language quality: Captions are grammatically correct, descriptive
Scale: Sufficient examples for the task (thousands for fine-tuning, millions for pre-training)
Domain 4: Software Development (15%)
NVIDIA Tools Reference
Tool
What It Does
When to Use It
NeMo
Full framework for building and training AI models
Custom multimodal model development
Picasso
Cloud service for visual content generation
Enterprise image/video generation
NIM
Pre-optimized model deployment microservices
Quick production deployment
Triton
High-performance inference server
Multi-model serving at scale
TensorRT
Inference optimization engine
Maximum GPU inference speed
Hugging Face Diffusers Key API
# Load pipelinefrom diffusers import StableDiffusionPipeline, DDIMScheduler
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16
).to("cuda")
# Change scheduler
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
# Generate image
image = pipe(
prompt="a castle on a hill, fantasy art",
negative_prompt="blurry, low quality",
num_inference_steps=30,
guidance_scale=7.5,
generator=torch.Generator("cuda").manual_seed(42)
).images[0]
# Save
image.save("output.png")
CLIP Usage Pattern
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Zero-shot classification
inputs = processor(
text=["a photo of a cat", "a photo of a dog"],
images=image,
return_tensors="pt"
)
outputs = model(**inputs)
similarity = outputs.logits_per_image # shape: [1, 2]
predicted_class = similarity.argmax() # 0 = cat, 1 = dog
Tool Selection Decision Tree
Need to TRAIN a custom model? → NVIDIA NeMo
Need enterprise image GENERATION? → NVIDIA Picasso
Need to DEPLOY a model quickly? → NVIDIA NIM
Need high-throughput MODEL SERVING? → Triton Inference Server
Need to OPTIMIZE inference speed? → TensorRT
Need to PROTOTYPE with diffusion? → Hugging Face Diffusers
Need to EVALUATE text-image match? → CLIP (via Transformers)