Fine-Tuning accounts for 13% of the NCP-GENL exam, contributing approximately 8-9 questions. The domain tests your ability to select the right fine-tuning approach for a given scenario, configure parameter-efficient methods correctly, and prevent common failures like catastrophic forgetting. Unlike the NCA-GENL Associate exam, which asks "what is LoRA?", the Professional exam asks "given this model size, dataset, hardware budget, and domain complexity, configure LoRA with the right rank, alpha, target modules, and training schedule."
This guide covers the fine-tuning techniques, configuration decisions, and trade-offs tested on the NCP-GENL exam.
Navigation
This article covers the Fine-Tuning domain (13%). For related NCP-GENL topics:
For NCP-AAI candidates: our Fine-Tuning for Agentic AI article covers similar PEFT concepts from an agentic AI perspective.
Full Fine-Tuning vs Parameter-Efficient Fine-Tuning
The exam expects you to know when each approach is appropriate. The decision is not always "use PEFT" — full fine-tuning still has its place.
Full Fine-Tuning
Updates all model parameters during training. Every weight in the model is modified.
Memory requirement: The full model must fit in GPU memory along with gradients and optimizer states.
Model Size
FP16 Weights
+ Gradients (FP16)
+ Adam Optimizer (FP32)
Total Training Memory
7B
14 GB
14 GB
56 GB
~84 GB + activations
13B
26 GB
26 GB
104 GB
~156 GB + activations
70B
140 GB
140 GB
560 GB
~840 GB + activations
When to use full fine-tuning:
Small models (under 7B parameters) where you have sufficient GPU resources
Large, diverse training datasets (100K+ examples) that justify updating all parameters
Maximum quality is required and you accept the compute cost
The domain shift from pre-training data is very large (e.g., adapting an English model to code generation)
When full fine-tuning is impractical:
Models larger than 13B without access to multi-GPU clusters
Limited training data (under 10K examples) where full fine-tuning risks overfitting
When you need to serve multiple fine-tuned variants from the same base model
Parameter-Efficient Fine-Tuning (PEFT)
PEFT methods update only a small subset of parameters (typically 0.1-2% of total), freezing the rest. This dramatically reduces memory requirements and training time.
Full Fine-Tuning vs PEFT
Factor
Full Fine-Tuning
PEFT (LoRA/QLoRA)
Parameters Updated
100%
0.1-2%
Memory (70B model)
~840 GB + activations
~35-160 GB depending on method
Training Speed
Slower (all gradients computed)
2-10x faster
Risk of Overfitting
Lower with large datasets
Lower with small datasets
Quality vs Base
Can exceed or degrade significantly
Typically 90-98% of full FT quality
Multi-Model Serving
Each variant = full model copy
Swap small adapters on single base model
Catastrophic Forgetting Risk
Higher (all weights modified)
Lower (base weights frozen)
Preparing for NCP-GENL? Practice with 455+ exam questions
LoRA is the most important PEFT method for the NCP-GENL exam. It works by decomposing weight update matrices into low-rank factors, dramatically reducing the number of trainable parameters.
How LoRA Works
Instead of updating a weight matrix W directly, LoRA adds a low-rank decomposition:
W_new = W_frozen + (B x A)
Where W is the original frozen weight matrix (d x k), A is the down-projection (r x k), B is the up-projection (d x r), and r is the rank (much smaller than d or k).
LoRA Rank Selection
Rank is the most critical LoRA hyperparameter. The exam tests your ability to select rank based on task complexity:
Rank (r)
Trainable Params (8B model, 4 modules)
Best For
Risk
r=4
~4.2M (0.05%)
Simple style/tone changes
Under-capacity for complex tasks
r=8
~8.4M (0.1%)
Light domain adaptation
Good default for simple tasks
r=16
~16.8M (0.21%)
Standard domain adaptation
Recommended default
r=32
~33.6M (0.42%)
Complex domain adaptation, multi-task
Higher memory, potential overfitting on small data
r=64
~67.2M (0.84%)
Near full fine-tuning expressiveness
Diminishing returns vs compute cost
Alpha Scaling:
The alpha parameter controls the magnitude of the LoRA update: effective_update = (alpha / r) x (B x A). Common convention: set alpha = 2 x r (e.g., alpha=32 for r=16). This means the effective learning rate for LoRA is scaled by alpha/r.
Exam Trap: Rank vs Alpha
A common exam question presents a LoRA adapter that is underperforming on complex domain adaptation with r=8. The correct fix is to increase rank to r=32 (giving the adapter more capacity), not to increase alpha (which just scales the update magnitude, risking instability) or increase training epochs (which risks overfitting without adding capacity). Conversely, if a high-rank adapter overfits on small data, reduce rank for regularization.
Target Module Selection
Which weight matrices to apply LoRA to affects both quality and parameter count:
Target Modules
Modules
Params (8B, r=16)
Quality
When to Use
q_proj, v_proj
2
~8.4M
Good
Budget-constrained, simple tasks
q_proj, k_proj, v_proj, o_proj
4
~16.8M
Better
Recommended default
All attention + MLP
7
~29.4M
Best
Complex domain adaptation
Exam insight: Applying LoRA to MLP layers (gate_proj, up_proj, down_proj) in addition to attention provides 1.75x more parameters at the same rank, which can significantly improve quality for complex tasks without increasing rank.
LoRA Parameter Efficiency Calculator
Parameter Reduction
99.6%
Full fine-tune: 16.78M params → LoRA: 0.07M params
Full Fine-Tune: 4096 × 4096 = 16.78M parameters
LoRA: 2 × 4096 × 8 = 0.07M parameters
QLoRA: Quantized LoRA
QLoRA combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of very large models on limited hardware.
How QLoRA Works
Base model quantized to 4-bit NF4: The frozen base model is stored in 4-bit NormalFloat format (0.5 bytes per parameter instead of 2 bytes for BF16)
LoRA adapters in BF16: The trainable LoRA matrices remain in full BF16 precision for training stability
Double quantization: The quantization constants themselves are quantized, saving an additional ~0.37 bits per parameter
Paged optimizers: Optimizer states use unified memory to spill to CPU when GPU memory is full
QLoRA Memory vs Standard LoRA
Memory_QLoRA = P_base x 0.5 + P_LoRA x (2 + 2 + 8) + QO
Copy
Model Memory Calculator
Required GPU Memory
14.00 GB
For 7B parameter model at FP16
When to Use QLoRA vs LoRA vs Full Fine-Tuning
Use QLoRA when you have limited GPU memory
QLoRA is the correct choice when your GPU hardware cannot hold the base model in BF16. Typical scenario: fine-tuning a 70B model on a single A6000 48GB GPU or 1-2 A100 40GB GPUs. QLoRA fits because it stores the base model at 0.5 bytes per parameter instead of 2 bytes. The accuracy trade-off from NF4 quantization of the base model is typically <1% when combined with a reasonable LoRA rank (r=16-32).
Use standard LoRA when you have sufficient GPU memory
Use full fine-tuning when you need maximum quality and have the compute
Special case: LoRA for multi-tenant serving
Other PEFT Methods
The exam tests awareness of PEFT methods beyond LoRA, though LoRA/QLoRA questions dominate.
Prefix Tuning
Prepends learnable "virtual tokens" (prefix vectors) to the input at each transformer layer. The model processes these virtual tokens alongside the real input, learning task-specific conditioning.
Trainable parameters: prefix_length x hidden_dim x n_layers. For prefix length 20, hidden dim 4096, 32 layers: 20 x 4096 x 32 = 2.6M parameters.
When to use: Sequence-to-sequence tasks, translation, summarization. Less effective than LoRA for instruction following and chat.
Adapter Layers
Inserts small bottleneck layers (down-projection, nonlinearity, up-projection) between existing transformer layers. Each adapter typically has a hidden dimension of 64-256.
Trainable parameters: 2 x adapter_dim x hidden_dim x n_adapters x n_layers. For adapter_dim=64, hidden_dim=4096, 2 adapters per layer, 32 layers: 2 x 64 x 4096 x 2 x 32 = 33.6M parameters.
When to use: Tasks requiring additional representational capacity beyond what LoRA provides. Adapters add new parameters rather than modifying existing weight matrices.
IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)
Learns per-element scaling vectors for keys, values, and feed-forward activations. Even fewer parameters than LoRA — scales rather than adds.
Trainable parameters: (d_k + d_v + d_ff) x n_layers. Dramatically smaller than LoRA but less expressive.
PEFT Methods Comparison
Method
Params (8B model)
Quality
Memory Overhead
Exam Focus
LoRA (r=16)
16.8M (0.21%)
High
Low
Primary focus
QLoRA (r=16)
16.8M (0.21%)
High (NF4 base)
Very low
Primary focus
Prefix Tuning
2.6M (0.03%)
Moderate
Very low
Awareness
Adapter Layers
33.6M (0.42%)
High
Low
Awareness
IA3
~0.6M (0.008%)
Lower
Minimal
Awareness
Fine-Tuning with NVIDIA NeMo Framework
NVIDIA NeMo is the primary framework for fine-tuning LLMs on NVIDIA hardware. The exam tests NeMo-specific configuration and workflow.
NeMo Fine-Tuning Pipeline
1. Prepare dataset (JSONL format)
→ {"input": "instruction", "output": "response"}
2. Load base model (.nemo checkpoint)
→ Llama, Mistral, Nemotron, etc.
3. Configure PEFT method
→ LoRA rank, alpha, target modules
4. Configure training
→ Learning rate, batch size, epochs, scheduler
5. Train with NeMo Launcher
→ Single-GPU or distributed (Megatron-LM backend)
6. Export adapter
→ Small .nemo file (50-200MB)
7. Deploy
→ Merge adapter into base or serve separately via Triton/NIM
NeMo expects data in JSONL format for supervised fine-tuning (SFT):
{"input":"Summarize the following research paper abstract:","output":"The paper presents..."}{"input":"Translate to French: The weather is nice today.","output":"Le temps est beau aujourd'hui."}
Catastrophic forgetting occurs when fine-tuning causes the model to lose previously learned capabilities. The exam tests multiple prevention strategies.
Prevention Strategies
Strategy
How It Works
Effectiveness
Exam Frequency
PEFT methods (LoRA/QLoRA)
Freeze base weights, only train adapters
Very high — base capabilities fully preserved
High
Data mixing
Include general-purpose data alongside domain data
High
High
Low learning rate
Reduce learning rate to minimize weight changes
Moderate
Medium
Early stopping
Stop before overfitting on domain data
Moderate
Medium
Elastic Weight Consolidation (EWC)
Penalize changes to weights important for prior tasks
High (compute-intensive)
Low
Regularization dropout
Increase dropout during fine-tuning
Moderate
Low
Data Mixing Ratios:
The exam often asks about optimal data mixing ratios for preventing catastrophic forgetting:
When a question asks "how do you prevent catastrophic forgetting when fine-tuning a 70B model for medical text generation?", the answer often starts with "use LoRA." Because LoRA freezes the base model weights and only trains small adapter matrices, the original capabilities are inherently preserved. Additional data mixing provides an extra safety margin. Full fine-tuning questions about catastrophic forgetting require more explicit prevention (data mixing + low LR + early stopping).
Fine-Tuning Data Preparation
The exam overlaps between the Fine-Tuning domain (13%) and the Data Preparation domain (9%). Quality of training data directly determines fine-tuning success.
Data Quality Checklist
Diversity: Cover the full range of expected inputs and outputs
Accuracy: All output labels/responses must be correct
Consistency: Same format and style across examples
Volume: 1K-10K examples for LoRA, 10K-100K+ for full fine-tuning
Deduplication: Remove duplicate or near-duplicate examples
Length distribution: Match expected production input/output lengths
Edge cases: Include boundary conditions and unusual inputs
Training Data Size Guidelines
Fine-Tuning Method
Minimum Examples
Optimal Range
Diminishing Returns
LoRA (r=8)
500
1K-5K
>10K for simple tasks
LoRA (r=16-32)
1K
5K-20K
>50K
QLoRA (r=16-32)
1K
5K-20K
>50K
Full Fine-Tuning
10K
50K-500K
Model and task dependent
Instruction Tuning
5K
10K-100K
>500K
RLHF and DPO: Alignment Fine-Tuning
The exam tests awareness of alignment techniques at a conceptual level, not implementation depth.
Reinforcement Learning from Human Feedback (RLHF)
RLHF aligns model outputs with human preferences using a three-stage pipeline:
Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality instruction-response pairs
Reward Model Training: Train a separate model to score outputs based on human preference rankings
PPO Optimization: Use proximal policy optimization to fine-tune the SFT model, maximizing the reward model's score while staying close to the SFT model (KL divergence penalty)
Direct Preference Optimization (DPO)
DPO simplifies RLHF by eliminating the reward model and PPO stages. It directly optimizes the model using preference pairs (chosen vs rejected responses).
DPO advantages over RLHF:
No separate reward model needed (reduces memory and complexity)
More stable training (no PPO hyperparameter tuning)
Comparable alignment quality for most use cases
RLHF vs DPO
Aspect
RLHF
DPO
Training Stages
3 (SFT + Reward Model + PPO)
2 (SFT + DPO)
Reward Model
Required (separate model)
Not required
Training Stability
Harder to tune (PPO sensitive)
More stable (direct optimization)
Compute Cost
Higher (multiple models in memory)
Lower (single model)
Quality
Slightly higher ceiling with careful tuning
Comparable for most tasks
Exam Focus
Conceptual understanding
When to prefer over RLHF
Practice Questions
Question 1: You need to fine-tune a 70B model for medical report generation using 8,000 high-quality training examples. You have access to a single A6000 48GB GPU. Which approach is most appropriate?
A) Full fine-tuning with gradient checkpointing
B) LoRA with r=16 targeting all attention modules
C) QLoRA with r=32, NF4 quantization, targeting attention and MLP modules
D) Prefix tuning with prefix length 50
Answer: C. The 70B model in BF16 requires 140GB, far exceeding the 48GB GPU even with gradient checkpointing (A is impossible). Standard LoRA (B) still needs the base model in BF16 (140GB) — does not fit. QLoRA (C) stores the base model in NF4 (35GB), fitting on the 48GB GPU with room for LoRA adapters and activations. r=32 with attention+MLP targeting provides sufficient capacity for medical domain adaptation. Prefix tuning (D) could fit but provides significantly lower quality for this type of task.
Question 2: After fine-tuning a 13B model with LoRA (r=8) on 3,000 customer support examples, the model performs well on support queries but has significantly degraded general reasoning capabilities. What is the most likely cause and fix?
A) LoRA rank is too high — reduce to r=4
B) The base model was corrupted during LoRA training — retrain from checkpoint
C) The training data caused catastrophic forgetting — add general-purpose data mixing at 30-40% ratio
D) LoRA is inherently unable to preserve base capabilities — switch to full fine-tuning
Answer: A is wrong because r=8 is already low and reducing rank would hurt domain performance. B is wrong because LoRA freezes base weights, so they cannot be corrupted. D is wrong because LoRA preserves base weights by design. The most likely issue is that the 3,000 examples are narrowly focused, and even though LoRA freezes base weights, the adapter outputs can dominate the model's behavior for general queries. The fix (C) is to include general-purpose examples in the training mix to ensure the adapter does not over-specialize. Note: this is a nuanced scenario — LoRA typically prevents catastrophic forgetting, but extreme domain-specific training data can cause the adapter to steer outputs away from general capabilities.
Question 3: You are serving 5 different LoRA adapters (medical, legal, financial, code, support) from a single 70B base model. What is the total GPU memory requirement compared to serving 5 separate full fine-tuned models?
A) 5x base model + 5x adapter overhead
B) 1x base model + 5x adapter overhead
C) 1x base model + 1x adapter overhead (swap at request time)
D) 5x base model (adapters are merged)
Answer: B. The base model is loaded once (140GB in FP16 for 70B). Each LoRA adapter is small (50-200MB) and can be loaded alongside the base model. All 5 adapters can be in memory simultaneously (~1GB total), with routing logic selecting the appropriate adapter per request. The alternative — 5 separate full fine-tuned models — would require 5 x 140GB = 700GB. LoRA serving reduces this to ~141GB, a 5x memory savings.