Use standard LoRA when you have sufficient GPU memory

If your hardware can hold the base model in BF16 plus LoRA adapter training overhead, prefer standard LoRA over QLoRA. Standard LoRA trains on an unquantized base model, which eliminates the small accuracy cost of NF4 quantization. Typical scenario: fine-tuning a 7-13B model on 1-2 A100 80GB GPUs. The base model fits in BF16, and LoRA overhead is minimal.

Use full fine-tuning when you need maximum quality and have the compute

Full fine-tuning produces the highest quality when you have: (1) a large, diverse training dataset (50K-500K+ examples), (2) sufficient GPU cluster capacity, (3) a large domain shift that requires modifying all model weights. Typical scenario: adapting a general-purpose 7B model into a specialized code generation model using 200K code samples. For models larger than 13B, full fine-tuning requires multi-GPU setups with ZeRO-3 or 3D parallelism.

Special case: LoRA for multi-tenant serving

LoRA has a unique deployment advantage: multiple LoRA adapters can share a single base model in production. Load the base model once (e.g., 140GB for 70B in FP16), then swap lightweight adapters (50-200MB each) per request or per tenant. This is dramatically more efficient than maintaining separate full fine-tuned model copies for each use case. NVIDIA Triton and NIM support LoRA adapter switching natively.

NCP-GENLNVIDIAGenerative AILLMCertificationFine-TuningLoRAQLoRAPEFT

NCP-GENL Fine-Tuning Guide: LoRA, QLoRA & PEFT for Production LLMs

Preporato TeamMay 21, 202617 min readNCP-GENL

Fine-Tuning accounts for 13% of the NCP-GENL exam, contributing approximately 8-9 questions. The domain tests your ability to select the right fine-tuning approach for a given scenario, configure parameter-efficient methods correctly, and prevent common failures like catastrophic forgetting. Unlike the NCA-GENL Associate exam, which asks "what is LoRA?", the Professional exam asks "given this model size, dataset, hardware budget, and domain complexity, configure LoRA with the right rank, alpha, target modules, and training schedule."

This guide covers the fine-tuning techniques, configuration decisions, and trade-offs tested on the NCP-GENL exam.

Navigation

This article covers the Fine-Tuning domain (13%). For related NCP-GENL topics:

Model Optimization & Quantization (17% domain)
GPU Acceleration & Distributed Training (14% domain)
NCP-GENL Complete Guide
NCP-GENL Cheat Sheet
For NCP-AAI candidates: our Fine-Tuning for Agentic AI article covers similar PEFT concepts from an agentic AI perspective.

Full Fine-Tuning vs Parameter-Efficient Fine-Tuning

The exam expects you to know when each approach is appropriate. The decision is not always "use PEFT" — full fine-tuning still has its place.

Full Fine-Tuning

Updates all model parameters during training. Every weight in the model is modified.

Memory requirement: The full model must fit in GPU memory along with gradients and optimizer states.

Model Size	FP16 Weights	+ Gradients (FP16)	+ Adam Optimizer (FP32)	Total Training Memory
7B	14 GB	14 GB	56 GB	~84 GB + activations
13B	26 GB	26 GB	104 GB	~156 GB + activations
70B	140 GB	140 GB	560 GB	~840 GB + activations

When to use full fine-tuning:

Small models (under 7B parameters) where you have sufficient GPU resources
Large, diverse training datasets (100K+ examples) that justify updating all parameters
Maximum quality is required and you accept the compute cost
The domain shift from pre-training data is very large (e.g., adapting an English model to code generation)

When full fine-tuning is impractical:

Models larger than 13B without access to multi-GPU clusters
Limited training data (under 10K examples) where full fine-tuning risks overfitting
When you need to serve multiple fine-tuned variants from the same base model

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods update only a small subset of parameters (typically 0.1-2% of total), freezing the rest. This dramatically reduces memory requirements and training time.

Full Fine-Tuning vs PEFT

Factor	Full Fine-Tuning	PEFT (LoRA/QLoRA)
Parameters Updated	100%	0.1-2%
Memory (70B model)	~840 GB + activations	~35-160 GB depending on method
Training Speed	Slower (all gradients computed)	2-10x faster
Risk of Overfitting	Lower with large datasets	Lower with small datasets
Quality vs Base	Can exceed or degrade significantly	Typically 90-98% of full FT quality
Multi-Model Serving	Each variant = full model copy	Swap small adapters on single base model
Catastrophic Forgetting Risk	Higher (all weights modified)	Lower (base weights frozen)

Ship LoRA + QLoRA

Run PEFT end-to-end on a real GPU

LoRA rank, alpha, and target-module questions reward candidates who have actually trained a model. Fine-tune a 7B with LoRA and QLoRA, see the FP16 vs 4-bit memory delta live, then layer continued pre-training and DPO alignment on top.

Preparing for NCP-GENL? Practice with 455+ exam questions

Try Free View Bundle - $19.99

LoRA: Low-Rank Adaptation

LoRA is the most important PEFT method for the NCP-GENL exam. It works by decomposing weight update matrices into low-rank factors, dramatically reducing the number of trainable parameters.

How LoRA Works

Instead of updating a weight matrix W directly, LoRA adds a low-rank decomposition:

W_new = W_frozen + (B x A)

Where W is the original frozen weight matrix (d x k), A is the down-projection (r x k), B is the up-projection (d x r), and r is the rank (much smaller than d or k).

LoRA Rank Selection

Rank is the most critical LoRA hyperparameter. The exam tests your ability to select rank based on task complexity:

Rank (r)	Trainable Params (8B model, 4 modules)	Best For	Risk
r=4	~4.2M (0.05%)	Simple style/tone changes	Under-capacity for complex tasks
r=8	~8.4M (0.1%)	Light domain adaptation	Good default for simple tasks
r=16	~16.8M (0.21%)	Standard domain adaptation	Recommended default
r=32	~33.6M (0.42%)	Complex domain adaptation, multi-task	Higher memory, potential overfitting on small data
r=64	~67.2M (0.84%)	Near full fine-tuning expressiveness	Diminishing returns vs compute cost

Alpha Scaling: The alpha parameter controls the magnitude of the LoRA update: effective_update = (alpha / r) x (B x A). Common convention: set alpha = 2 x r (e.g., alpha=32 for r=16). This means the effective learning rate for LoRA is scaled by alpha/r.

Exam Trap: Rank vs Alpha

A common exam question presents a LoRA adapter that is underperforming on complex domain adaptation with r=8. The correct fix is to increase rank to r=32 (giving the adapter more capacity), not to increase alpha (which just scales the update magnitude, risking instability) or increase training epochs (which risks overfitting without adding capacity). Conversely, if a high-rank adapter overfits on small data, reduce rank for regularization.

Target Module Selection

Which weight matrices to apply LoRA to affects both quality and parameter count:

Target Modules	Modules	Params (8B, r=16)	Quality	When to Use
q_proj, v_proj	2	~8.4M	Good	Budget-constrained, simple tasks
q_proj, k_proj, v_proj, o_proj	4	~16.8M	Better	Recommended default
All attention + MLP	7	~29.4M	Best	Complex domain adaptation

Exam insight: Applying LoRA to MLP layers (gate_proj, up_proj, down_proj) in addition to attention provides 1.75x more parameters at the same rank, which can significantly improve quality for complex tasks without increasing rank.

LoRA Parameter Efficiency Calculator

d_model (Layer Size)

LoRA Rank (r)

Parameter Reduction

99.6%

Full fine-tune: 16.78M params → LoRA: 0.07M params

Full Fine-Tune: 4096 × 4096 = 16.78M parameters

LoRA: 2 × 4096 × 8 = 0.07M parameters

QLoRA: Quantized LoRA

QLoRA combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of very large models on limited hardware.

How QLoRA Works

Base model quantized to 4-bit NF4: The frozen base model is stored in 4-bit NormalFloat format (0.5 bytes per parameter instead of 2 bytes for BF16)
LoRA adapters in BF16: The trainable LoRA matrices remain in full BF16 precision for training stability
Double quantization: The quantization constants themselves are quantized, saving an additional ~0.37 bits per parameter
Paged optimizers: Optimizer states use unified memory to spill to CPU when GPU memory is full

Model Memory Calculator

Parameters (Billions)

Precision

Required GPU Memory

14.00 GB

For 7B parameter model at FP16

When to Use QLoRA vs LoRA vs Full Fine-Tuning

QLoRA is the correct choice when your GPU hardware cannot hold the base model in BF16. Typical scenario: fine-tuning a 70B model on a single A6000 48GB GPU or 1-2 A100 40GB GPUs. QLoRA fits because it stores the base model at 0.5 bytes per parameter instead of 2 bytes. The accuracy trade-off from NF4 quantization of the base model is typically <1% when combined with a reasonable LoRA rank (r=16-32).

Other PEFT Methods

The exam tests awareness of PEFT methods beyond LoRA, though LoRA/QLoRA questions dominate.

Prefix Tuning

Prepends learnable "virtual tokens" (prefix vectors) to the input at each transformer layer. The model processes these virtual tokens alongside the real input, learning task-specific conditioning.

Trainable parameters: prefix_length x hidden_dim x n_layers. For prefix length 20, hidden dim 4096, 32 layers: 20 x 4096 x 32 = 2.6M parameters.

When to use: Sequence-to-sequence tasks, translation, summarization. Less effective than LoRA for instruction following and chat.

Adapter Layers

Inserts small bottleneck layers (down-projection, nonlinearity, up-projection) between existing transformer layers. Each adapter typically has a hidden dimension of 64-256.

Trainable parameters: 2 x adapter_dim x hidden_dim x n_adapters x n_layers. For adapter_dim=64, hidden_dim=4096, 2 adapters per layer, 32 layers: 2 x 64 x 4096 x 2 x 32 = 33.6M parameters.

When to use: Tasks requiring additional representational capacity beyond what LoRA provides. Adapters add new parameters rather than modifying existing weight matrices.

IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)

Learns per-element scaling vectors for keys, values, and feed-forward activations. Even fewer parameters than LoRA — scales rather than adds.

Trainable parameters: (d_k + d_v + d_ff) x n_layers. Dramatically smaller than LoRA but less expressive.

PEFT Methods Comparison

Method	Params (8B model)	Quality	Memory Overhead	Exam Focus
LoRA (r=16)	16.8M (0.21%)	High	Low	Primary focus
QLoRA (r=16)	16.8M (0.21%)	High (NF4 base)	Very low	Primary focus
Prefix Tuning	2.6M (0.03%)	Moderate	Very low	Awareness
Adapter Layers	33.6M (0.42%)	High	Low	Awareness
IA3	~0.6M (0.008%)	Lower	Minimal	Awareness

Fine-Tuning with NVIDIA NeMo Framework

NVIDIA NeMo is the primary framework for fine-tuning LLMs on NVIDIA hardware. The exam tests NeMo-specific configuration and workflow.

NeMo Fine-Tuning Pipeline

1. Prepare dataset (JSONL format)
   → {"input": "instruction", "output": "response"}

2. Load base model (.nemo checkpoint)
   → Llama, Mistral, Nemotron, etc.

3. Configure PEFT method
   → LoRA rank, alpha, target modules

4. Configure training
   → Learning rate, batch size, epochs, scheduler

5. Train with NeMo Launcher
   → Single-GPU or distributed (Megatron-LM backend)

6. Export adapter
   → Small .nemo file (50-200MB)

7. Deploy
   → Merge adapter into base or serve separately via Triton/NIM

NeMo LoRA Configuration

from nemo.collections.nlp.models.language_modeling import MegatronGPTSFTModel

# PEFT configuration
peft_cfg = {
    "peft_scheme": "lora",
    "lora_tuning": {
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "adapter_dim": 16,       # rank
        "alpha": 32,             # alpha = 2 * rank
        "adapter_dropout": 0.05,
    }
}

# Training configuration
training_cfg = {
    "trainer": {
        "max_epochs": 3,
        "precision": "bf16-mixed",
        "devices": 1,             # single GPU for LoRA
        "accumulate_grad_batches": 4,
    },
    "model": {
        "learning_rate": 2e-4,    # higher LR safe with LoRA
        "weight_decay": 0.01,
        "warmup_steps": 100,
    }
}

NeMo Data Format

NeMo expects data in JSONL format for supervised fine-tuning (SFT):

{"input": "Summarize the following research paper abstract:", "output": "The paper presents..."}
{"input": "Translate to French: The weather is nice today.", "output": "Le temps est beau aujourd'hui."}

For instruction tuning (chat format):

{"conversations": [{"role": "user", "content": "Explain quantization."}, {"role": "assistant", "content": "Quantization reduces..."}]}

Master These Concepts with Practice

Our NCP-GENL practice bundle includes:

7 full practice exams (455+ questions)
Detailed explanations for every answer
Domain-by-domain performance tracking

Try 15 Free Questions Get Full Access - $19.99

30-day money-back guarantee

Catastrophic Forgetting Prevention

Catastrophic forgetting occurs when fine-tuning causes the model to lose previously learned capabilities. The exam tests multiple prevention strategies.

Prevention Strategies

Strategy	How It Works	Effectiveness	Exam Frequency
PEFT methods (LoRA/QLoRA)	Freeze base weights, only train adapters	Very high — base capabilities fully preserved	High
Data mixing	Include general-purpose data alongside domain data	High	High
Low learning rate	Reduce learning rate to minimize weight changes	Moderate	Medium
Early stopping	Stop before overfitting on domain data	Moderate	Medium
Elastic Weight Consolidation (EWC)	Penalize changes to weights important for prior tasks	High (compute-intensive)	Low
Regularization dropout	Increase dropout during fine-tuning	Moderate	Low

Data Mixing Ratios:

The exam often asks about optimal data mixing ratios for preventing catastrophic forgetting:

Scenario	Domain Data	General Data	Rationale
Light domain adaptation	70-80%	20-30%	Primarily learning new domain, light preservation
Heavy domain adaptation	50-60%	40-50%	Equal emphasis on new knowledge and preservation
Instruction tuning	100% instruction data	0%	Format change only, not domain shift

Exam Pattern: LoRA = Built-in Forgetting Prevention

When a question asks "how do you prevent catastrophic forgetting when fine-tuning a 70B model for medical text generation?", the answer often starts with "use LoRA." Because LoRA freezes the base model weights and only trains small adapter matrices, the original capabilities are inherently preserved. Additional data mixing provides an extra safety margin. Full fine-tuning questions about catastrophic forgetting require more explicit prevention (data mixing + low LR + early stopping).

Fine-Tuning Data Preparation

The exam overlaps between the Fine-Tuning domain (13%) and the Data Preparation domain (9%). Quality of training data directly determines fine-tuning success.

Data Quality Checklist

Diversity: Cover the full range of expected inputs and outputs
Accuracy: All output labels/responses must be correct
Consistency: Same format and style across examples
Volume: 1K-10K examples for LoRA, 10K-100K+ for full fine-tuning
Deduplication: Remove duplicate or near-duplicate examples
Length distribution: Match expected production input/output lengths
Edge cases: Include boundary conditions and unusual inputs

Training Data Size Guidelines

Fine-Tuning Method	Minimum Examples	Optimal Range	Diminishing Returns
LoRA (r=8)	500	1K-5K	>10K for simple tasks
LoRA (r=16-32)	1K	5K-20K	>50K
QLoRA (r=16-32)	1K	5K-20K	>50K
Full Fine-Tuning	10K	50K-500K	Model and task dependent
Instruction Tuning	5K	10K-100K	>500K

RLHF and DPO: Alignment Fine-Tuning

The exam tests awareness of alignment techniques at a conceptual level, not implementation depth.

Reinforcement Learning from Human Feedback (RLHF)

RLHF aligns model outputs with human preferences using a three-stage pipeline:

Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality instruction-response pairs
Reward Model Training: Train a separate model to score outputs based on human preference rankings
PPO Optimization: Use proximal policy optimization to fine-tune the SFT model, maximizing the reward model's score while staying close to the SFT model (KL divergence penalty)

Direct Preference Optimization (DPO)

DPO simplifies RLHF by eliminating the reward model and PPO stages. It directly optimizes the model using preference pairs (chosen vs rejected responses).

DPO advantages over RLHF:

No separate reward model needed (reduces memory and complexity)
More stable training (no PPO hyperparameter tuning)
Comparable alignment quality for most use cases

RLHF vs DPO

Aspect	RLHF	DPO
Training Stages	3 (SFT + Reward Model + PPO)	2 (SFT + DPO)
Reward Model	Required (separate model)	Not required
Training Stability	Harder to tune (PPO sensitive)	More stable (direct optimization)
Compute Cost	Higher (multiple models in memory)	Lower (single model)
Quality	Slightly higher ceiling with careful tuning	Comparable for most tasks
Exam Focus	Conceptual understanding	When to prefer over RLHF

Practice Questions

Question 1: You need to fine-tune a 70B model for medical report generation using 8,000 high-quality training examples. You have access to a single A6000 48GB GPU. Which approach is most appropriate?

A) Full fine-tuning with gradient checkpointing B) LoRA with r=16 targeting all attention modules C) QLoRA with r=32, NF4 quantization, targeting attention and MLP modules D) Prefix tuning with prefix length 50

Answer: C. The 70B model in BF16 requires 140GB, far exceeding the 48GB GPU even with gradient checkpointing (A is impossible). Standard LoRA (B) still needs the base model in BF16 (140GB) — does not fit. QLoRA (C) stores the base model in NF4 (35GB), fitting on the 48GB GPU with room for LoRA adapters and activations. r=32 with attention+MLP targeting provides sufficient capacity for medical domain adaptation. Prefix tuning (D) could fit but provides significantly lower quality for this type of task.

Question 2: After fine-tuning a 13B model with LoRA (r=8) on 3,000 customer support examples, the model performs well on support queries but has significantly degraded general reasoning capabilities. What is the most likely cause and fix?

A) LoRA rank is too high — reduce to r=4 B) The base model was corrupted during LoRA training — retrain from checkpoint C) The training data caused catastrophic forgetting — add general-purpose data mixing at 30-40% ratio D) LoRA is inherently unable to preserve base capabilities — switch to full fine-tuning

Answer: A is wrong because r=8 is already low and reducing rank would hurt domain performance. B is wrong because LoRA freezes base weights, so they cannot be corrupted. D is wrong because LoRA preserves base weights by design. The most likely issue is that the 3,000 examples are narrowly focused, and even though LoRA freezes base weights, the adapter outputs can dominate the model's behavior for general queries. The fix (C) is to include general-purpose examples in the training mix to ensure the adapter does not over-specialize. Note: this is a nuanced scenario — LoRA typically prevents catastrophic forgetting, but extreme domain-specific training data can cause the adapter to steer outputs away from general capabilities.

Question 3: You are serving 5 different LoRA adapters (medical, legal, financial, code, support) from a single 70B base model. What is the total GPU memory requirement compared to serving 5 separate full fine-tuned models?

A) 5x base model + 5x adapter overhead B) 1x base model + 5x adapter overhead C) 1x base model + 1x adapter overhead (swap at request time) D) 5x base model (adapters are merged)

Answer: B. The base model is loaded once (140GB in FP16 for 70B). Each LoRA adapter is small (50-200MB) and can be loaded alongside the base model. All 5 adapters can be in memory simultaneously (~1GB total), with routing logic selecting the appropriate adapter per request. The alternative — 5 separate full fine-tuned models — would require 5 x 140GB = 700GB. LoRA serving reduces this to ~141GB, a 5x memory savings.

For comprehensive practice across all 10 NCP-GENL domains, try our NCP-GENL practice exams.

Summary: Fine-Tuning Key Takeaways

Concept	Key Fact for the Exam
LoRA	Low-rank adapters. Typical: r=16, alpha=32, target 4 attention modules. 0.1-0.3% trainable params.
QLoRA	LoRA + NF4 base model. 4x memory reduction. Enables 70B on single 48GB GPU.
Rank selection	r=4 (simple), r=8 (light), r=16 (standard), r=32 (complex), r=64 (max PEFT capacity).
Target modules	q,v (minimal) vs full attention (default) vs attention+MLP (maximum quality).
Alpha	Set alpha = 2 x rank as default. Effective update scales as alpha/r.
Full FT vs PEFT	Full FT: large data + compute. PEFT: limited data/compute, multi-tenant serving.
Catastrophic forgetting	LoRA inherently prevents it (frozen base). Data mixing (20-40% general) adds safety.
NeMo Framework	NVIDIA's FT framework. JSONL data format. Megatron-LM backend for distributed training.
RLHF vs DPO	RLHF: 3-stage (SFT+RM+PPO). DPO: 2-stage (SFT+DPO), simpler, comparable quality.
Multi-tenant serving	1 base model + N adapters. Each adapter ~50-200MB. Dramatic memory savings.

For the full preparation strategy, see our How to Pass NCP-GENL guide and NCP-GENL Cheat Sheet for quick reference.

Ready to Pass the NCP-GENL Exam?

Join thousands who passed with Preporato practice tests

Start Practicing Now - $19.99

Instant access30-day guaranteeUpdated monthly

NCP-GENL

7 Practice Exams

Detailed Explanations

Performance Analytics

Get Full Access - $19.99 Try Free Questions →