Preporato
NCP-GENLNVIDIAGenerative AILLMCertificationFine-TuningLoRAQLoRAPEFT

NCP-GENL Fine-Tuning Guide: LoRA, QLoRA & PEFT for Production LLMs

Preporato TeamApril 2, 202617 min readNCP-GENL
NCP-GENL Fine-Tuning Guide: LoRA, QLoRA & PEFT for Production LLMs

Fine-Tuning accounts for 13% of the NCP-GENL exam, contributing approximately 8-9 questions. The domain tests your ability to select the right fine-tuning approach for a given scenario, configure parameter-efficient methods correctly, and prevent common failures like catastrophic forgetting. Unlike the NCA-GENL Associate exam, which asks "what is LoRA?", the Professional exam asks "given this model size, dataset, hardware budget, and domain complexity, configure LoRA with the right rank, alpha, target modules, and training schedule."

This guide covers the fine-tuning techniques, configuration decisions, and trade-offs tested on the NCP-GENL exam.

Navigation

This article covers the Fine-Tuning domain (13%). For related NCP-GENL topics:

Full Fine-Tuning vs Parameter-Efficient Fine-Tuning

The exam expects you to know when each approach is appropriate. The decision is not always "use PEFT" — full fine-tuning still has its place.

Full Fine-Tuning

Updates all model parameters during training. Every weight in the model is modified.

Memory requirement: The full model must fit in GPU memory along with gradients and optimizer states.

Model SizeFP16 Weights+ Gradients (FP16)+ Adam Optimizer (FP32)Total Training Memory
7B14 GB14 GB56 GB~84 GB + activations
13B26 GB26 GB104 GB~156 GB + activations
70B140 GB140 GB560 GB~840 GB + activations

When to use full fine-tuning:

  • Small models (under 7B parameters) where you have sufficient GPU resources
  • Large, diverse training datasets (100K+ examples) that justify updating all parameters
  • Maximum quality is required and you accept the compute cost
  • The domain shift from pre-training data is very large (e.g., adapting an English model to code generation)

When full fine-tuning is impractical:

  • Models larger than 13B without access to multi-GPU clusters
  • Limited training data (under 10K examples) where full fine-tuning risks overfitting
  • When you need to serve multiple fine-tuned variants from the same base model

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods update only a small subset of parameters (typically 0.1-2% of total), freezing the rest. This dramatically reduces memory requirements and training time.

Full Fine-Tuning vs PEFT

FactorFull Fine-TuningPEFT (LoRA/QLoRA)
Parameters Updated100%0.1-2%
Memory (70B model)~840 GB + activations~35-160 GB depending on method
Training SpeedSlower (all gradients computed)2-10x faster
Risk of OverfittingLower with large datasetsLower with small datasets
Quality vs BaseCan exceed or degrade significantlyTypically 90-98% of full FT quality
Multi-Model ServingEach variant = full model copySwap small adapters on single base model
Catastrophic Forgetting RiskHigher (all weights modified)Lower (base weights frozen)

Preparing for NCP-GENL? Practice with 455+ exam questions

LoRA: Low-Rank Adaptation

LoRA is the most important PEFT method for the NCP-GENL exam. It works by decomposing weight update matrices into low-rank factors, dramatically reducing the number of trainable parameters.

How LoRA Works

Instead of updating a weight matrix W directly, LoRA adds a low-rank decomposition:

W_new = W_frozen + (B x A)

Where W is the original frozen weight matrix (d x k), A is the down-projection (r x k), B is the up-projection (d x r), and r is the rank (much smaller than d or k).

LoRA Rank Selection

Rank is the most critical LoRA hyperparameter. The exam tests your ability to select rank based on task complexity:

Rank (r)Trainable Params (8B model, 4 modules)Best ForRisk
r=4~4.2M (0.05%)Simple style/tone changesUnder-capacity for complex tasks
r=8~8.4M (0.1%)Light domain adaptationGood default for simple tasks
r=16~16.8M (0.21%)Standard domain adaptationRecommended default
r=32~33.6M (0.42%)Complex domain adaptation, multi-taskHigher memory, potential overfitting on small data
r=64~67.2M (0.84%)Near full fine-tuning expressivenessDiminishing returns vs compute cost

Alpha Scaling: The alpha parameter controls the magnitude of the LoRA update: effective_update = (alpha / r) x (B x A). Common convention: set alpha = 2 x r (e.g., alpha=32 for r=16). This means the effective learning rate for LoRA is scaled by alpha/r.

Exam Trap: Rank vs Alpha

A common exam question presents a LoRA adapter that is underperforming on complex domain adaptation with r=8. The correct fix is to increase rank to r=32 (giving the adapter more capacity), not to increase alpha (which just scales the update magnitude, risking instability) or increase training epochs (which risks overfitting without adding capacity). Conversely, if a high-rank adapter overfits on small data, reduce rank for regularization.

Target Module Selection

Which weight matrices to apply LoRA to affects both quality and parameter count:

Target ModulesModulesParams (8B, r=16)QualityWhen to Use
q_proj, v_proj2~8.4MGoodBudget-constrained, simple tasks
q_proj, k_proj, v_proj, o_proj4~16.8MBetterRecommended default
All attention + MLP7~29.4MBestComplex domain adaptation

Exam insight: Applying LoRA to MLP layers (gate_proj, up_proj, down_proj) in addition to attention provides 1.75x more parameters at the same rank, which can significantly improve quality for complex tasks without increasing rank.

LoRA Parameter Efficiency Calculator

Parameter Reduction
99.6%
Full fine-tune: 16.78M params → LoRA: 0.07M params
Full Fine-Tune: 4096 × 4096 = 16.78M parameters
LoRA: 2 × 4096 × 8 = 0.07M parameters

QLoRA: Quantized LoRA

QLoRA combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of very large models on limited hardware.

How QLoRA Works

  1. Base model quantized to 4-bit NF4: The frozen base model is stored in 4-bit NormalFloat format (0.5 bytes per parameter instead of 2 bytes for BF16)
  2. LoRA adapters in BF16: The trainable LoRA matrices remain in full BF16 precision for training stability
  3. Double quantization: The quantization constants themselves are quantized, saving an additional ~0.37 bits per parameter
  4. Paged optimizers: Optimizer states use unified memory to spill to CPU when GPU memory is full

Model Memory Calculator

Required GPU Memory
14.00 GB
For 7B parameter model at FP16

When to Use QLoRA vs LoRA vs Full Fine-Tuning

QLoRA is the correct choice when your GPU hardware cannot hold the base model in BF16. Typical scenario: fine-tuning a 70B model on a single A6000 48GB GPU or 1-2 A100 40GB GPUs. QLoRA fits because it stores the base model at 0.5 bytes per parameter instead of 2 bytes. The accuracy trade-off from NF4 quantization of the base model is typically <1% when combined with a reasonable LoRA rank (r=16-32).

Other PEFT Methods

The exam tests awareness of PEFT methods beyond LoRA, though LoRA/QLoRA questions dominate.

Prefix Tuning

Prepends learnable "virtual tokens" (prefix vectors) to the input at each transformer layer. The model processes these virtual tokens alongside the real input, learning task-specific conditioning.

Trainable parameters: prefix_length x hidden_dim x n_layers. For prefix length 20, hidden dim 4096, 32 layers: 20 x 4096 x 32 = 2.6M parameters.

When to use: Sequence-to-sequence tasks, translation, summarization. Less effective than LoRA for instruction following and chat.

Adapter Layers

Inserts small bottleneck layers (down-projection, nonlinearity, up-projection) between existing transformer layers. Each adapter typically has a hidden dimension of 64-256.

Trainable parameters: 2 x adapter_dim x hidden_dim x n_adapters x n_layers. For adapter_dim=64, hidden_dim=4096, 2 adapters per layer, 32 layers: 2 x 64 x 4096 x 2 x 32 = 33.6M parameters.

When to use: Tasks requiring additional representational capacity beyond what LoRA provides. Adapters add new parameters rather than modifying existing weight matrices.

IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations)

Learns per-element scaling vectors for keys, values, and feed-forward activations. Even fewer parameters than LoRA — scales rather than adds.

Trainable parameters: (d_k + d_v + d_ff) x n_layers. Dramatically smaller than LoRA but less expressive.

PEFT Methods Comparison

MethodParams (8B model)QualityMemory OverheadExam Focus
LoRA (r=16)16.8M (0.21%)HighLowPrimary focus
QLoRA (r=16)16.8M (0.21%)High (NF4 base)Very lowPrimary focus
Prefix Tuning2.6M (0.03%)ModerateVery lowAwareness
Adapter Layers33.6M (0.42%)HighLowAwareness
IA3~0.6M (0.008%)LowerMinimalAwareness

Fine-Tuning with NVIDIA NeMo Framework

NVIDIA NeMo is the primary framework for fine-tuning LLMs on NVIDIA hardware. The exam tests NeMo-specific configuration and workflow.

NeMo Fine-Tuning Pipeline

1. Prepare dataset (JSONL format)
   → {"input": "instruction", "output": "response"}

2. Load base model (.nemo checkpoint)
   → Llama, Mistral, Nemotron, etc.

3. Configure PEFT method
   → LoRA rank, alpha, target modules

4. Configure training
   → Learning rate, batch size, epochs, scheduler

5. Train with NeMo Launcher
   → Single-GPU or distributed (Megatron-LM backend)

6. Export adapter
   → Small .nemo file (50-200MB)

7. Deploy
   → Merge adapter into base or serve separately via Triton/NIM

NeMo LoRA Configuration

from nemo.collections.nlp.models.language_modeling import MegatronGPTSFTModel

# PEFT configuration
peft_cfg = {
    "peft_scheme": "lora",
    "lora_tuning": {
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "adapter_dim": 16,       # rank
        "alpha": 32,             # alpha = 2 * rank
        "adapter_dropout": 0.05,
    }
}

# Training configuration
training_cfg = {
    "trainer": {
        "max_epochs": 3,
        "precision": "bf16-mixed",
        "devices": 1,             # single GPU for LoRA
        "accumulate_grad_batches": 4,
    },
    "model": {
        "learning_rate": 2e-4,    # higher LR safe with LoRA
        "weight_decay": 0.01,
        "warmup_steps": 100,
    }
}

NeMo Data Format

NeMo expects data in JSONL format for supervised fine-tuning (SFT):

{"input": "Summarize the following research paper abstract:", "output": "The paper presents..."}
{"input": "Translate to French: The weather is nice today.", "output": "Le temps est beau aujourd'hui."}

For instruction tuning (chat format):

{"conversations": [{"role": "user", "content": "Explain quantization."}, {"role": "assistant", "content": "Quantization reduces..."}]}

Master These Concepts with Practice

Our NCP-GENL practice bundle includes:

  • 7 full practice exams (455+ questions)
  • Detailed explanations for every answer
  • Domain-by-domain performance tracking

30-day money-back guarantee

Catastrophic Forgetting Prevention

Catastrophic forgetting occurs when fine-tuning causes the model to lose previously learned capabilities. The exam tests multiple prevention strategies.

Prevention Strategies

StrategyHow It WorksEffectivenessExam Frequency
PEFT methods (LoRA/QLoRA)Freeze base weights, only train adaptersVery high — base capabilities fully preservedHigh
Data mixingInclude general-purpose data alongside domain dataHighHigh
Low learning rateReduce learning rate to minimize weight changesModerateMedium
Early stoppingStop before overfitting on domain dataModerateMedium
Elastic Weight Consolidation (EWC)Penalize changes to weights important for prior tasksHigh (compute-intensive)Low
Regularization dropoutIncrease dropout during fine-tuningModerateLow

Data Mixing Ratios:

The exam often asks about optimal data mixing ratios for preventing catastrophic forgetting:

ScenarioDomain DataGeneral DataRationale
Light domain adaptation70-80%20-30%Primarily learning new domain, light preservation
Heavy domain adaptation50-60%40-50%Equal emphasis on new knowledge and preservation
Instruction tuning100% instruction data0%Format change only, not domain shift

Exam Pattern: LoRA = Built-in Forgetting Prevention

When a question asks "how do you prevent catastrophic forgetting when fine-tuning a 70B model for medical text generation?", the answer often starts with "use LoRA." Because LoRA freezes the base model weights and only trains small adapter matrices, the original capabilities are inherently preserved. Additional data mixing provides an extra safety margin. Full fine-tuning questions about catastrophic forgetting require more explicit prevention (data mixing + low LR + early stopping).

Fine-Tuning Data Preparation

The exam overlaps between the Fine-Tuning domain (13%) and the Data Preparation domain (9%). Quality of training data directly determines fine-tuning success.

Data Quality Checklist

  1. Diversity: Cover the full range of expected inputs and outputs
  2. Accuracy: All output labels/responses must be correct
  3. Consistency: Same format and style across examples
  4. Volume: 1K-10K examples for LoRA, 10K-100K+ for full fine-tuning
  5. Deduplication: Remove duplicate or near-duplicate examples
  6. Length distribution: Match expected production input/output lengths
  7. Edge cases: Include boundary conditions and unusual inputs

Training Data Size Guidelines

Fine-Tuning MethodMinimum ExamplesOptimal RangeDiminishing Returns
LoRA (r=8)5001K-5K>10K for simple tasks
LoRA (r=16-32)1K5K-20K>50K
QLoRA (r=16-32)1K5K-20K>50K
Full Fine-Tuning10K50K-500KModel and task dependent
Instruction Tuning5K10K-100K>500K

RLHF and DPO: Alignment Fine-Tuning

The exam tests awareness of alignment techniques at a conceptual level, not implementation depth.

Reinforcement Learning from Human Feedback (RLHF)

RLHF aligns model outputs with human preferences using a three-stage pipeline:

  1. Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality instruction-response pairs
  2. Reward Model Training: Train a separate model to score outputs based on human preference rankings
  3. PPO Optimization: Use proximal policy optimization to fine-tune the SFT model, maximizing the reward model's score while staying close to the SFT model (KL divergence penalty)

Direct Preference Optimization (DPO)

DPO simplifies RLHF by eliminating the reward model and PPO stages. It directly optimizes the model using preference pairs (chosen vs rejected responses).

DPO advantages over RLHF:

  • No separate reward model needed (reduces memory and complexity)
  • More stable training (no PPO hyperparameter tuning)
  • Comparable alignment quality for most use cases

RLHF vs DPO

AspectRLHFDPO
Training Stages3 (SFT + Reward Model + PPO)2 (SFT + DPO)
Reward ModelRequired (separate model)Not required
Training StabilityHarder to tune (PPO sensitive)More stable (direct optimization)
Compute CostHigher (multiple models in memory)Lower (single model)
QualitySlightly higher ceiling with careful tuningComparable for most tasks
Exam FocusConceptual understandingWhen to prefer over RLHF

Practice Questions

Question 1: You need to fine-tune a 70B model for medical report generation using 8,000 high-quality training examples. You have access to a single A6000 48GB GPU. Which approach is most appropriate?

A) Full fine-tuning with gradient checkpointing B) LoRA with r=16 targeting all attention modules C) QLoRA with r=32, NF4 quantization, targeting attention and MLP modules D) Prefix tuning with prefix length 50

Answer: C. The 70B model in BF16 requires 140GB, far exceeding the 48GB GPU even with gradient checkpointing (A is impossible). Standard LoRA (B) still needs the base model in BF16 (140GB) — does not fit. QLoRA (C) stores the base model in NF4 (35GB), fitting on the 48GB GPU with room for LoRA adapters and activations. r=32 with attention+MLP targeting provides sufficient capacity for medical domain adaptation. Prefix tuning (D) could fit but provides significantly lower quality for this type of task.

Question 2: After fine-tuning a 13B model with LoRA (r=8) on 3,000 customer support examples, the model performs well on support queries but has significantly degraded general reasoning capabilities. What is the most likely cause and fix?

A) LoRA rank is too high — reduce to r=4 B) The base model was corrupted during LoRA training — retrain from checkpoint C) The training data caused catastrophic forgetting — add general-purpose data mixing at 30-40% ratio D) LoRA is inherently unable to preserve base capabilities — switch to full fine-tuning

Answer: A is wrong because r=8 is already low and reducing rank would hurt domain performance. B is wrong because LoRA freezes base weights, so they cannot be corrupted. D is wrong because LoRA preserves base weights by design. The most likely issue is that the 3,000 examples are narrowly focused, and even though LoRA freezes base weights, the adapter outputs can dominate the model's behavior for general queries. The fix (C) is to include general-purpose examples in the training mix to ensure the adapter does not over-specialize. Note: this is a nuanced scenario — LoRA typically prevents catastrophic forgetting, but extreme domain-specific training data can cause the adapter to steer outputs away from general capabilities.

Question 3: You are serving 5 different LoRA adapters (medical, legal, financial, code, support) from a single 70B base model. What is the total GPU memory requirement compared to serving 5 separate full fine-tuned models?

A) 5x base model + 5x adapter overhead B) 1x base model + 5x adapter overhead C) 1x base model + 1x adapter overhead (swap at request time) D) 5x base model (adapters are merged)

Answer: B. The base model is loaded once (140GB in FP16 for 70B). Each LoRA adapter is small (50-200MB) and can be loaded alongside the base model. All 5 adapters can be in memory simultaneously (~1GB total), with routing logic selecting the appropriate adapter per request. The alternative — 5 separate full fine-tuned models — would require 5 x 140GB = 700GB. LoRA serving reduces this to ~141GB, a 5x memory savings.

For comprehensive practice across all 10 NCP-GENL domains, try our NCP-GENL practice exams.

Summary: Fine-Tuning Key Takeaways

ConceptKey Fact for the Exam
LoRALow-rank adapters. Typical: r=16, alpha=32, target 4 attention modules. 0.1-0.3% trainable params.
QLoRALoRA + NF4 base model. 4x memory reduction. Enables 70B on single 48GB GPU.
Rank selectionr=4 (simple), r=8 (light), r=16 (standard), r=32 (complex), r=64 (max PEFT capacity).
Target modulesq,v (minimal) vs full attention (default) vs attention+MLP (maximum quality).
AlphaSet alpha = 2 x rank as default. Effective update scales as alpha/r.
Full FT vs PEFTFull FT: large data + compute. PEFT: limited data/compute, multi-tenant serving.
Catastrophic forgettingLoRA inherently prevents it (frozen base). Data mixing (20-40% general) adds safety.
NeMo FrameworkNVIDIA's FT framework. JSONL data format. Megatron-LM backend for distributed training.
RLHF vs DPORLHF: 3-stage (SFT+RM+PPO). DPO: 2-stage (SFT+DPO), simpler, comparable quality.
Multi-tenant serving1 base model + N adapters. Each adapter ~50-200MB. Dramatic memory savings.

For the full preparation strategy, see our How to Pass NCP-GENL guide and NCP-GENL Cheat Sheet for quick reference.

Ready to Pass the NCP-GENL Exam?

Join thousands who passed with Preporato practice tests

Instant access30-day guaranteeUpdated monthly