Preporato
NCA-GENMNVIDIAGenerative AIMultimodal AICertification

NCA-GENM Cheat Sheet 2026: Quick Reference for Exam Day

Preporato TeamApril 2, 202612 min readNCA-GENM
NCA-GENM Cheat Sheet 2026: Quick Reference for Exam Day

Print-friendly | Download PDF | Save for Exam Day

This comprehensive cheat sheet covers all 7 NCA-GENM exam domains with formulas, architecture details, comparison tables, and key facts. Based on the official NVIDIA exam blueprint (25/20/15/15/10/10/5 weighting).


Domain 1: Experimentation (25%)

Evaluation Metrics Quick Reference

MetricMeasuresScaleFormula IntuitionWhen to Use
FIDQuality + diversityLower = betterDistance between real and generated image distributionsComparing two generators overall
ISQuality + diversityHigher = betterClassifier confidence on generated imagesQuick quality check
CLIP ScoreText-image alignmentHigher = betterCosine similarity of CLIP embeddingsDoes image match prompt?
BLEUN-gram overlap0-1, higher = betterPrecision of n-gram matches with referenceCaption evaluation
CIDErCaption consensusHigher = betterTF-IDF weighted n-gram similarityBest for captioning
METEORWord alignment0-1, higher = betterPrecision + recall with synonymsCaptioning with paraphrases

FID (Frechet Inception Distance)

Diffusion Model Hyperparameters

ParameterWhat It ControlsDefault/TypicalLow Value EffectHigh Value Effect
Guidance ScalePrompt adherence vs diversity7.5Creative, diverse, less accurateStrong adherence, may over-saturate
Inference StepsQuality vs speed30-50Faster but lower qualityBetter quality, diminishing returns
SchedulerDenoising trajectoryDDPMBaseline, slowDDIM/DPM-Solver: fewer steps needed
SeedReproducibilityRandomN/ASame seed = same output
Negative PromptWhat NOT to generateEmptyNo constraintsRemoves unwanted features

Guidance Scale Cheat Sheet

Scale 1.0  → No guidance (unconditional generation)
Scale 3-5  → Creative, diverse, less prompt-accurate
Scale 7-8  → Balanced (default for most use cases)
Scale 10-12 → Strong prompt adherence
Scale 15+  → Over-saturated, artifacts likely — avoid

Scheduler Comparison

SchedulerSteps NeededSpeedQualityNotes
DDPM1000SlowestHighOriginal, rarely used for inference
DDIM20-50FastGoodDeterministic sampling option
Euler20-30FastGoodSimple, effective
DPM-Solver++15-25FastestGoodState-of-the-art efficiency
UniPC10-20Very fastGoodUnified predictor-corrector

Fine-Tuning Methods

MethodTrainable ParamsData NeededBest ForCompute
LoRA<1%50-500 imagesStyle transferLow
DreamBooth~All (+ prior)3-10 imagesSubject personalizationHigh
Textual Inversion1 embedding5-15 imagesLightweight concept learningVery low
Full Fine-Tuning100%1000+ imagesDomain adaptationVery high

Prompt Engineering for Image Generation

Effective prompt structure:

[Subject] [Details] [Style] [Quality modifiers]
Example: "A golden retriever puppy, sitting in a field of wildflowers,
oil painting style, soft lighting, highly detailed, 4k"

Negative prompt examples:

"blurry, low quality, distorted, deformed, watermark, text,
oversaturated, bad anatomy, extra limbs"

Prompt weighting (Stable Diffusion syntax):

(important concept:1.3)  → 30% more emphasis
(less important:0.7)     → 30% less emphasis

Preparing for NCA-GENM? Practice with 455+ exam questions

Domain 2: Core ML and AI Knowledge (20%)

Vision Transformer (ViT) Architecture

Processing Pipeline:

Image (224×224) → Split into patches (14×14 grid of 16×16 patches)
→ Linear projection (each patch → embedding vector)
→ Add position embeddings (learnable, not sinusoidal)
→ Prepend CLS token
→ Transformer encoder (self-attention across all patches)
→ CLS token output → Classification head

Key Numbers:

  • ViT-Base: 86M params, 12 layers, 768 hidden dim, 12 heads
  • ViT-Large: 307M params, 24 layers, 1024 hidden dim, 16 heads
  • Standard patch size: 16×16 pixels
  • Standard input: 224×224 → 196 patches + 1 CLS token = 197 tokens

CLIP Architecture

Dual Encoder Structure:

Text → Text Encoder (Transformer) → Text Embedding (512-d)
                                                          → Cosine Similarity
Image → Image Encoder (ViT/ResNet) → Image Embedding (512-d)

Contrastive Training:

  • Batch of N text-image pairs
  • N correct matches (diagonal) should have HIGH similarity
  • N² - N incorrect matches (off-diagonal) should have LOW similarity
  • Loss: symmetric cross-entropy on the similarity matrix

Zero-Shot Classification:

1. Create text prompts: "a photo of a [class]" for each class
2. Encode all text prompts → text embeddings
3. Encode the image → image embedding
4. Compute cosine similarity between image and each text embedding
5. Predicted class = highest similarity

CLIP Key Facts:

  • Does NOT generate images — only aligns text and image representations
  • Trained on 400M text-image pairs from the internet
  • Enables zero-shot transfer without any fine-tuning
  • CLIP Score uses this model to evaluate text-image alignment

Diffusion Models

Forward Process (Adding Noise):

x_0 (clean image) → x_1 → x_2 → ... → x_T (pure Gaussian noise)
Each step: x_t = √(α_t) × x_{t-1} + √(1 - α_t) × ε
where ε ~ N(0, I)

Reverse Process (Learned Denoising):

x_T (noise) → x_{T-1} → ... → x_1 → x_0 (generated image)
Neural network learns: ε_θ(x_t, t) ≈ ε (predict the noise)

Latent Diffusion Model (Stable Diffusion) Pipeline

Text Prompt → CLIP Text Encoder → Text Embeddings
                                        ↓ (cross-attention)
Random Noise → [U-Net Denoiser × N steps] → Denoised Latent
                                                    ↓
                                        VAE Decoder → Generated Image

Key Components:

  • VAE Encoder: Compresses 512×512 image → 64×64 latent (8× spatial reduction)
  • U-Net: Predicts noise in latent space, conditioned on text via cross-attention
  • CLIP Text Encoder: Converts text prompt to embedding vectors
  • VAE Decoder: Converts denoised latent back to pixel space
  • Scheduler: Controls the denoising trajectory (step size, noise removal rate)

Cross-Attention vs Self-Attention

FeatureSelf-AttentionCross-Attention
Q, K, V sourceAll from same inputQ from one modality, K/V from another
PurposeModel relationships within a sequenceFuse information across modalities
In Stable DiffusionImage latent attends to itselfImage latent (Q) attends to text (K, V)
ResultSpatial coherence in imageText conditioning of image generation

Variational Autoencoder (VAE)

Architecture:

Input Image → Encoder → μ, σ (mean, std of latent distribution)
→ Reparameterization: z = μ + σ × ε (ε ~ N(0,1))
→ Decoder → Reconstructed Image

Model Architecture Comparison

ArchitectureTypeInput ModalitiesKey Use CaseExamples
ViTEncoderImageClassification, featuresViT-B/16, DINOv2
CLIPDual EncoderText + ImageAlignment, retrievalOpenCLIP, SigLIP
Stable DiffusionDiffusion + VAEText → ImageImage generationSD 2.1, SDXL
DALL-EDiffusion/AutoregressiveText → ImageImage generationDALL-E 3
LLaVAVision-LanguageImage + Text → TextVisual QA, captioningLLaVA-1.5, LLaVA-NeXT
WhisperEncoder-DecoderAudio → TextSpeech recognitionWhisper Large-v3

Domain 3: Multimodal Data (15%)

Image Preprocessing Pipeline

StepOperationTypical ValuesPurpose
1. ResizeResize to model size224×224 (ViT), 512×512 (SD)Match expected input
2. Center CropCrop center regionSquare cropRemove borders
3. NormalizeScale pixel valuesImageNet: mean=[0.485, 0.456, 0.406]Standardize input
4. To TensorConvert to tensorChannel-first (C, H, W)Framework compatibility

Data Augmentation Safety Guide

AugmentationSafe for Multimodal?Risk
Horizontal FlipUsually safeBreaks "left/right" descriptions
Random CropSafe if subject visibleMay crop out described objects
Color Jitter (mild)Usually safeBreaks color-specific descriptions
Color Jitter (strong)Unsafe"Red car" becomes blue
Rotation (small)SafeBreaks orientation descriptions
Vertical FlipUsually unsafeBreaks gravity, spatial relations
Cutout/ErasingRiskyMay remove described objects

Audio Data Representations

FormatDescriptionDimensionsUsed By
WaveformRaw amplitude over time1D signalWaveNet, SoundStream
SpectrogramTime × Frequency2D (like image)General audio models
Mel SpectrogramPerceptual frequency scale2D (like image)Whisper, audio transformers
MFCCCompact spectral featuresFeature vectorsTraditional speech

Dataset Quality Checklist


Domain 4: Software Development (15%)

NVIDIA Tools Reference

ToolWhat It DoesWhen to Use It
NeMoFull framework for building and training AI modelsCustom multimodal model development
PicassoCloud service for visual content generationEnterprise image/video generation
NIMPre-optimized model deployment microservicesQuick production deployment
TritonHigh-performance inference serverMulti-model serving at scale
TensorRTInference optimization engineMaximum GPU inference speed

Hugging Face Diffusers Key API

# Load pipeline
from diffusers import StableDiffusionPipeline, DDIMScheduler
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
).to("cuda")

# Change scheduler
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)

# Generate image
image = pipe(
    prompt="a castle on a hill, fantasy art",
    negative_prompt="blurry, low quality",
    num_inference_steps=30,
    guidance_scale=7.5,
    generator=torch.Generator("cuda").manual_seed(42)
).images[0]

# Save
image.save("output.png")

CLIP Usage Pattern

from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Zero-shot classification
inputs = processor(
    text=["a photo of a cat", "a photo of a dog"],
    images=image,
    return_tensors="pt"
)
outputs = model(**inputs)
similarity = outputs.logits_per_image  # shape: [1, 2]
predicted_class = similarity.argmax()  # 0 = cat, 1 = dog

Tool Selection Decision Tree

Need to TRAIN a custom model?        → NVIDIA NeMo
Need enterprise image GENERATION?    → NVIDIA Picasso
Need to DEPLOY a model quickly?      → NVIDIA NIM
Need high-throughput MODEL SERVING?   → Triton Inference Server
Need to OPTIMIZE inference speed?     → TensorRT
Need to PROTOTYPE with diffusion?     → Hugging Face Diffusers
Need to EVALUATE text-image match?    → CLIP (via Transformers)

Master These Concepts with Practice

Our NCA-GENM practice bundle includes:

  • 7 full practice exams (455+ questions)
  • Detailed explanations for every answer
  • Domain-by-domain performance tracking

30-day money-back guarantee

Domain 5: Data Analysis and Visualization (10%)

Visualization Techniques

TechniqueWhat It ShowsHow to CreateInterpretation
Attention MapsWhere model looks in imageExtract attention weights, overlay on imageBright regions = high attention
Grad-CAMClass-relevant image regionsGradient-weighted activation mapsHighlights decision-relevant areas
t-SNEEmbedding clusters in 2DApply t-SNE to CLIP embeddingsSimilar items cluster together
UMAPEmbedding structure in 2DFaster alternative to t-SNEPreserves global structure better
Training CurvesLoss over timePlot loss vs steps/epochsShould decrease smoothly
FID over TrainingGeneration quality over timeCompute FID periodicallyShould decrease, then plateau

Training Curve Diagnosis

SymptomDiagnosisFix
Training loss drops, val loss risesOverfittingMore data, regularization, early stopping
Both losses remain highUnderfittingMore capacity, more training, lower LR
Loss spikes or NaNDivergenceLower learning rate, gradient clipping
Loss plateaus very earlyLearning rate too lowIncrease LR, use warmup schedule
FID stops improvingTraining saturatedStop training, adjust architecture

Embedding Space Quality Indicators

Good CLIP embedding space (t-SNE visualization):

Poor embedding space:


Domain 6: Performance Optimization (10%)

Quantization Reference

PrecisionBytes/ParamMemory (1B)Speed vs FP32Quality Loss
FP3244 GB1x (baseline)None
FP1622 GB~2xNegligible
BF1622 GB~2xNegligible (better for training)
INT811 GB~3-4xMinimal with calibration
INT40.50.5 GB~4-6xNoticeable, needs calibration

Diffusion Speed Optimization

TechniqueSpeedupQuality ImpactDifficulty
FP16 inference~2xNegligibleEasy
Fewer steps (50→25)~2xMildEasy
Fast scheduler (DPM++)~2-3x fewer stepsMinimalEasy
TensorRT compilation~2-3xNoneMedium
Step distillation~4-8xRequires distilled modelHard
Consistency models~1-4 stepsSlight quality trade-offHard

Serving Optimization

StrategyWhat It DoesWhen to Use
Dynamic BatchingGroups incoming requestsMultiple concurrent users
Model CachingKeeps model in GPU memoryRepeated model use
Request QueuingManages request overflowHigh traffic
Horizontal ScalingMultiple GPU instancesExceeding single-GPU capacity
Async ProcessingNon-blocking generationLong-running tasks (image gen)

Domain 7: Trustworthy AI (5%)

Key Concepts

ConceptDefinitionExample
Visual BiasModel generates stereotypical images"CEO" generates only white male images
Content SafetyFiltering harmful generated contentNSFW classifier before output delivery
WatermarkingInvisible markers in generated imagesProves AI origin for provenance
DeepfakesRealistic fake face/video generationUsed for misinformation, fraud
Data ProvenanceTracking training data sourcesImportant for IP compliance
ConsentPermission for data usageEspecially for face images

Bias Detection Methods

  1. Prompt diversity testing: Generate images for the same role/concept across demographics
  2. Demographic analysis: Measure representation in generated outputs
  3. Comparative evaluation: Compare outputs for "doctor" vs "nurse" for gender distribution
  4. Red teaming: Deliberately test edge cases and sensitive prompts
  5. Automated classifiers: Use attribute classifiers to measure output distributions

Content Safety Pipeline

User Prompt → Prompt Safety Check → [If safe] → Generate Image
                                               → Output Safety Check → [If safe] → Deliver
                                               → [If unsafe] → Block + Log
              → [If unsafe] → Block + Notify User

Watermarking Methods

MethodVisibilityRobustnessUse Case
Visible watermarkUser sees itEasy to removePreviews, demos
Invisible watermarkImperceptibleSurvives crops, compressionProduction, provenance
Metadata embeddingIn file metadataEasy to stripBasic tracking
Spectral watermarkIn frequency domainHigh robustnessResearch, verification

Quick Reference: Numbers to Remember

FactValue
ViT-Base patch size16×16 pixels
ViT-Base input size224×224 pixels
ViT-Base patches196 (14×14 grid)
CLIP training data400M text-image pairs
CLIP embedding dim512
Stable Diffusion latent size64×64 (from 512×512 input)
Typical guidance scale7.5
Typical inference steps20-50
FP16 memory savings2× over FP32
INT8 memory savings4× over FP32
NCA-GENM exam time60 minutes
NCA-GENM questions50-60
NCA-GENM cost$125
NCA-GENM validity2 years

Exam Day Quick Tips

  1. Largest domain is Experimentation (25%) — know metrics and hyperparameters cold
  2. CLIP appears across multiple domains — architecture, evaluation, visualization
  3. Diffusion models are the core generative technology — understand forward/reverse process
  4. ViT treats images as sequences of patches — not pixels, not arbitrary regions
  5. Cross-attention fuses text and image — text provides K/V, image provides Q
  6. FID = quality + diversity (lower better) vs CLIP Score = alignment (higher better)
  7. Guidance scale 7-8 is the default — 1.0 means no guidance, 15+ causes artifacts
  8. LoRA for style, DreamBooth for subjects, Textual Inversion for lightweight concepts
  9. FP16 is almost always a free speedup — negligible quality loss
  10. Never leave a question blank — no penalty for guessing

For more detailed coverage:

Ready to Pass the NCA-GENM Exam?

Join thousands who passed with Preporato practice tests

Instant access30-day guaranteeUpdated monthly

More NCA-GENM Articles