This comprehensive cheat sheet covers all 7 NCA-GENM exam domains with formulas, architecture details, comparison tables, and key facts. Based on the official NVIDIA exam blueprint (25/20/15/15/10/10/5 weighting).
Distance between real and generated image distributions
Comparing two generators overall
IS
Quality + diversity
Higher = better
Classifier confidence on generated images
Quick quality check
CLIP Score
Text-image alignment
Higher = better
Cosine similarity of CLIP embeddings
Does image match prompt?
BLEU
N-gram overlap
0-1, higher = better
Precision of n-gram matches with reference
Caption evaluation
CIDEr
Caption consensus
Higher = better
TF-IDF weighted n-gram similarity
Best for captioning
METEOR
Word alignment
0-1, higher = better
Precision + recall with synonyms
Captioning with paraphrases
FID (Frechet Inception Distance)
Diffusion Model Hyperparameters
Parameter
What It Controls
Default/Typical
Low Value Effect
High Value Effect
Guidance Scale
Prompt adherence vs diversity
7.5
Creative, diverse, less accurate
Strong adherence, may over-saturate
Inference Steps
Quality vs speed
30-50
Faster but lower quality
Better quality, diminishing returns
Scheduler
Denoising trajectory
DDPM
Baseline, slow
DDIM/DPM-Solver: fewer steps needed
Seed
Reproducibility
Random
N/A
Same seed = same output
Negative Prompt
What NOT to generate
Empty
No constraints
Removes unwanted features
Guidance Scale Cheat Sheet
Scale 1.0 → No guidance (unconditional generation)
Scale 3-5 → Creative, diverse, less prompt-accurate
Scale 7-8 → Balanced (default for most use cases)
Scale 10-12 → Strong prompt adherence
Scale 15+ → Over-saturated, artifacts likely — avoid
Scheduler Comparison
Scheduler
Steps Needed
Speed
Quality
Notes
DDPM
1000
Slowest
High
Original, rarely used for inference
DDIM
20-50
Fast
Good
Deterministic sampling option
Euler
20-30
Fast
Good
Simple, effective
DPM-Solver++
15-25
Fastest
Good
State-of-the-art efficiency
UniPC
10-20
Very fast
Good
Unified predictor-corrector
Fine-Tuning Methods
Method
Trainable Params
Data Needed
Best For
Compute
LoRA
<1%
50-500 images
Style transfer
Low
DreamBooth
~All (+ prior)
3-10 images
Subject personalization
High
Textual Inversion
1 embedding
5-15 images
Lightweight concept learning
Very low
Full Fine-Tuning
100%
1000+ images
Domain adaptation
Very high
Prompt Engineering for Image Generation
Effective prompt structure:
[Subject] [Details] [Style] [Quality modifiers]
Example: "A golden retriever puppy, sitting in a field of wildflowers,
oil painting style, soft lighting, highly detailed, 4k"
Negative prompt examples:
"blurry, low quality, distorted, deformed, watermark, text,
oversaturated, bad anatomy, extra limbs"
Prompt weighting (Stable Diffusion syntax):
(important concept:1.3) → 30% more emphasis
(less important:0.7) → 30% less emphasis
Preparing for NCA-GENM? Practice with 455+ exam questions
Text → Text Encoder (Transformer) → Text Embedding (512-d)
→ Cosine Similarity
Image → Image Encoder (ViT/ResNet) → Image Embedding (512-d)
Contrastive Training:
Batch of N text-image pairs
N correct matches (diagonal) should have HIGH similarity
N² - N incorrect matches (off-diagonal) should have LOW similarity
Loss: symmetric cross-entropy on the similarity matrix
Zero-Shot Classification:
1. Create text prompts: "a photo of a [class]" for each class
2. Encode all text prompts → text embeddings
3. Encode the image → image embedding
4. Compute cosine similarity between image and each text embedding
5. Predicted class = highest similarity
CLIP Key Facts:
Does NOT generate images — only aligns text and image representations
Trained on 400M text-image pairs from the internet
Enables zero-shot transfer without any fine-tuning
CLIP Score uses this model to evaluate text-image alignment
Full framework for building and training AI models
Custom multimodal model development
Picasso
Cloud service for visual content generation
Enterprise image/video generation
NIM
Pre-optimized model deployment microservices
Quick production deployment
Triton
High-performance inference server
Multi-model serving at scale
TensorRT
Inference optimization engine
Maximum GPU inference speed
Hugging Face Diffusers Key API
# Load pipelinefrom diffusers import StableDiffusionPipeline, DDIMScheduler
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16
).to("cuda")
# Change scheduler
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
# Generate image
image = pipe(
prompt="a castle on a hill, fantasy art",
negative_prompt="blurry, low quality",
num_inference_steps=30,
guidance_scale=7.5,
generator=torch.Generator("cuda").manual_seed(42)
).images[0]
# Save
image.save("output.png")
CLIP Usage Pattern
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Zero-shot classification
inputs = processor(
text=["a photo of a cat", "a photo of a dog"],
images=image,
return_tensors="pt"
)
outputs = model(**inputs)
similarity = outputs.logits_per_image # shape: [1, 2]
predicted_class = similarity.argmax() # 0 = cat, 1 = dog
Tool Selection Decision Tree
Need to TRAIN a custom model? → NVIDIA NeMo
Need enterprise image GENERATION? → NVIDIA Picasso
Need to DEPLOY a model quickly? → NVIDIA NIM
Need high-throughput MODEL SERVING? → Triton Inference Server
Need to OPTIMIZE inference speed? → TensorRT
Need to PROTOTYPE with diffusion? → Hugging Face Diffusers
Need to EVALUATE text-image match? → CLIP (via Transformers)