Fine-Tune an LLM with LoRA and QLoRA
GPU sandbox · notebook
Beta

Fine-Tune an LLM with LoRA and QLoRA

Fine-tune Meta Llama 3 8B on a custom instruction dataset using LoRA and QLoRA. Learn parameter-efficient fine-tuning from data preparation through evaluation — the #1 most demanded AI skill.

45 min·7 steps·3 domains·Intermediate·ncp-genlncp-adsnca-genl

What you'll learn

  1. 1
    Explore the Base Model
    In this lab, you'll take Meta Llama 3 8B — a raw base model with no instruction-following ability — and fine-tune it into a medical Q&A specialist. By the end, your model will go from incoherent text completion to structured, accurate clinical answers.
  2. 2
    Prepare the Training Dataset
    Fine-tuning is only as good as your data. A poorly formatted dataset teaches the model bad habits. A well-structured one makes it dramatically better.
  3. 3
    Understand LoRA — How It Works
    LoRA (Low-Rank Adaptation) is the most popular parameter-efficient fine-tuning method. Instead of updating all 8 billion parameters, LoRA freezes the original model and injects small trainable matrices into specific layers.
  4. 4
    Fine-Tune with QLoRA
    Now we combine everything: the tokenized dataset from Step 2, LoRA adapters from Step 3, and Hugging Face's Trainer API.
  5. 5
    Understanding Quantization — FP16 vs 4-bit
    You just trained with QLoRA — but what exactly does "4-bit" mean, and when would you choose differently?
  6. 6
    Evaluate the Fine-Tuned Model
    This is the payoff. You'll compare the raw base model (which can't follow instructions) against your fine-tuned version (which can).
  7. 7
    Merge and Export
    Right now your fine-tuned model is stored as two parts:

Prerequisites

  • Basic Python (functions, loops, dicts)
  • Familiarity with PyTorch tensors
  • Understanding of what LLMs are and how they generate text

Exam domains covered

Fine-Tuning & Data PreparationGPU Acceleration & Distributed TrainingEvaluation, Monitoring & Safety

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

LoRAQLoRAFine-TuningLlama 3PEFTHugging Facebitsandbytes

What you'll build in this LoRA fine-tuning lab

Across seven steps you'll turn a raw Llama 3 8B base model into a medical-Q&A specialist using QLoRA, and prove the change with perplexity. Step 1 loads the base checkpoint at /models/meta-llama--Meta-Llama-3-8B in FP16 and fires the same three clinical prompts you'll revisit after training. Step 2 filters the Alpaca dataset down to 2,000 medical examples, formats them into the ### Instruction: ... ### Response: ... template, and saves tokenized train/val splits to disk. Step 3 constructs a LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','k_proj','v_proj','o_proj'], task_type=TaskType.CAUSAL_LM) and wraps the model with get_peft_model. Step 4 re-loads the base in 4-bit via BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', ...), calls prepare_model_for_kbit_training, and runs the Hugging Face Trainer for 100 steps with per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4. Step 5 is a quantization deep dive comparing FP16 vs NF4 VRAM. Step 6 toggles disable_adapter_layers() and enable_adapter_layers() on a single PeftModel to measure base vs adapted perplexity on the held-out set. Step 7 calls merge_and_unload() and exports a deployable checkpoint.

The mental model you leave with: LoRA isn't a smaller model, it's a structural shortcut. You freeze the 8B base weights and train two tiny matrices B and A per attention projection, so the effective update W + BA is learnable with ~0.1% of the parameters and a fraction of the optimizer memory. QLoRA adds the trick that makes it fit a consumer 24 GB card: base weights stored as NF4 (a data-distribution-aware 4-bit format with double quantisation), compute done in FP16, gradients flowing only into the LoRA adapters. gradient_checkpointing_enable() trades recompute for activation memory; bnb_4bit_use_double_quant=True quantises the quantisation constants themselves. The live loss curve plus the base-vs-adapted perplexity bar chart are the ground truth that the recipe actually works.

Prerequisites are a little Python, a rough sense of PyTorch tensors, and knowing what next-token prediction is — no fine-tuning experience assumed; the lab walks you through PEFT from first principles. Budget about 45 minutes on the real NVIDIA GPU sandbox we provision per session: the base model, the Alpaca dataset, transformers, peft, bitsandbytes, and Trainer are all preinstalled. Each of the seven steps is auto-graded by a notebook check: Step 2 verifies the tokenized datasets landed on disk, Step 4 opens adapter_config.json and confirms r=16 plus the expected target_modules, Step 6 requires ft_ppl < base_ppl, Step 7 checks that safetensors and tokenizer files are present under merged-model/.

Frequently asked questions

Why does rank=16 update less than 1% of the weights but still move perplexity meaningfully?

Because the gradient signal on a specialization task lives in a low-dimensional subspace of the 4096×4096 attention matrices. LoRA parameterises the update as W + BA where A is (4096, r) and B is (r, 4096), so at r=16 you train 2 × 16 × 4096 = 131,072 parameters per projection instead of 16.7 million — a 128× compression. The LoRA paper and every subsequent ablation show r=8-16 captures essentially all of the gain on instruction-tuning style tasks. Higher rank mostly adds overfitting risk, not capacity.

What does merge_and_unload() actually do and when would I NOT call it?

It computes W' = W + (α/r) × B @ A for each adapted projection, replaces the frozen weights with W' in place, and strips the PEFT wrapper so the result is a plain LlamaForCausalLM. Deployment-friendly: vLLM, TensorRT-LLM, and Triton all accept it as a normal model. You would NOT merge if you want to hot-swap multiple adapters on one base at serve time (vLLM's --enable-lora path), or if you're stacking several LoRAs and plan to scale α differently per request. Merge is a one-way door.

Why NF4 specifically, not INT4?

Neural network weights are approximately normally distributed around zero. INT4 assigns quantisation bins uniformly, wasting resolution on rare large values. NF4 (NormalFloat4) places its 16 bins at the theoretical quantiles of a unit normal, so each bin carries equal weight-probability mass — you get lower reconstruction error for the same 4 bits per parameter. Combined with bnb_4bit_use_double_quant=True, which quantises the per-block absmax constants themselves, you end up around 0.5 extra bits per weight saved versus single-level NF4. The result is <1% quality loss against FP16 on most instruction-tuning benchmarks.

Why is learning_rate=2e-4 so much higher than a full fine-tune would use?

Full fine-tunes of Llama-class models typically run at 1e-5 to 5e-5 because every parameter is moving and the loss landscape is fragile. LoRA only trains a tiny subspace of freshly-initialised adapter weights (A is Kaiming-init, B is zero — the initial update is exactly zero, so the model starts identical to the base), so it tolerates — and benefits from — a much larger LR. 1e-4 to 3e-4 with warmup is the standard LoRA/QLoRA range. Lower and 100 steps isn't enough to move the adapters; higher and training becomes unstable on small datasets.

Can I really fine-tune an 8B model on a single 24 GB GPU?

Yes, and QLoRA is specifically designed to make it fit. The 4-bit base is about 5 GB. Adapter weights plus their optimizer state are under 1 GB. Activations with gradient_checkpointing_enable() stay below 10 GB at per_device_train_batch_size=4 and max_length=512. The effective batch size of 16 comes from gradient_accumulation_steps=4, which costs wall time but zero VRAM. If you tried to do the same job in FP16 without LoRA you'd need about 80 GB for weights + Adam state alone. The sandbox pod we provision sits comfortably in the QLoRA envelope.

The check says ft_ppl < base_ppl — what counts as a good improvement?

On 100 training steps over the medical-Alpaca slice with the hyperparameters in Step 4, most runs drop perplexity on the held-out validation set by 15-40% relative to the base. That's a meaningful signal — the adapters learned the instruction-response format plus some medical terminology distribution — without being so large that it implies overfitting to 1,800 training examples. Longer training or higher rank pushes the gap wider but also raises the risk of catastrophic forgetting on general text; you can see that tradeoff up close in the continued-pretraining lab.