Fine-Tune an LLM with LoRA and QLoRA (Jupyter)
GPU sandbox · jupyter
Beta

Fine-Tune an LLM with LoRA and QLoRA (Jupyter)

Fine-tune Meta Llama 3 8B on a custom instruction dataset using LoRA and QLoRA. Learn parameter-efficient fine-tuning from data preparation through evaluation — the #1 most demanded AI skill.

45 min·7 steps·3 domains·Intermediate·ncp-genlncp-adsnca-genl

What you'll learn

  1. 1
    Explore the Base Model
  2. 2
    Prepare the Training Dataset
  3. 3
    Understand LoRA — How It Works
  4. 4
    Fine-Tune with QLoRA
  5. 5
    Understanding Quantization — FP16 vs 4-bit
  6. 6
    Evaluate the Fine-Tuned Model
  7. 7
    Merge and Export

Prerequisites

  • Basic Python (functions, loops, dicts)
  • Familiarity with PyTorch tensors
  • Understanding of what LLMs are and how they generate text

Exam domains covered

Fine-Tuning & Data PreparationGPU Acceleration & Distributed TrainingEvaluation, Monitoring & Safety

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

LoRAQLoRAFine-TuningLlama 3PEFTJupyterLab

What you'll build in this QLoRA-in-JupyterLab lab

LoRA and QLoRA are the highest-leverage LLM skills of 2026 — fine-tuning a Llama-class model on a consumer-grade GPU used to be a multi-node distributed training job, and now it's a 45-minute project you run in one notebook. You'll walk away with a fine-tuned Llama 3 8B medical Q&A model merged to a deployable checkpoint, a working mental model of rank (r=16), alpha (α=32), and which projections to adapt (q_proj, k_proj, v_proj, o_proj), the VRAM math that separates FP16 (won't fit on a 4090 once you add optimizer state) from NF4 (15+ GB headroom for training plus eval plus a running kernel), and live loss/perplexity curves rendered inline with matplotlib as the run progresses. No pip install, no CUDA driver wrangling — you land in a live NVIDIA pod with Llama 3, PEFT, bitsandbytes, and a JupyterLab kernel already running.

The technical substance is the three-way interplay between parameter-efficient fine-tuning, 4-bit quantization, and notebook-resident state. LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','k_proj','v_proj','o_proj']) trains under 1% of parameters while still shifting the model's behavior; BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16) stores weights as NF4 quantiles fitted to the normal distribution then dequantizes on the fly for FP16 compute. You'll see directly why merge_and_unload() (the W + (α/r) × B @ A bake-in that hands you a plain LlamaForCausalLM) is different from PeftModel.from_pretrained (runtime composition with togglable adapters), and you'll hit the Jupyter-specific trap that catches everyone their first time: del model doesn't free VRAM because IPython's Out[] cache plus _, __, ___ pin the whole computation graph. The correct cleanup is %reset out -f then gc.collect() then torch.cuda.empty_cache() in that order — otherwise the 4-bit reload OOMs on a 24 GB card despite the NF4 footprint being a quarter of FP16.

Prerequisites are basic Python, a rough grip on PyTorch tensors, and knowing that LLMs predict the next token — no prior PEFT or QLoRA experience assumed. The sandbox is a real NVIDIA GPU pod we provision per session with JupyterLab, HuggingFace transformers, peft, bitsandbytes, datasets, matplotlib, and the Llama 3 8B base checkpoint all preinstalled and pinned to compatible versions. Checks run inline after each step (from preporato_labs import Lab; lab = Lab(...).check(N)) against kernel-resident state and on-disk artifacts — adapter_config.json must show r==16 with q_proj in target_modules, ft_ppl < base_ppl after training, and safetensors under merged-model/ at the end. The merged checkpoint is drop-in ready for vLLM or TensorRT-LLM, which is where you take it next.

Frequently asked questions

Why do I need to explicitly reset IPython output history between the LoRA steps?

Because JupyterLab silently caches every cell's return value in Out[<n>] plus the _, __, ___ shortcuts. If a cell ever returned a tensor or a model, that reference pins the whole computation graph — del model doesn't free it. Step 4 unloads the Step 3 model before loading the 4-bit base, and the cleanup has to call get_ipython().run_line_magic('reset', 'out -f') plus gc.collect() plus torch.cuda.empty_cache() in that order. Skip the history reset and the new QLoRA load OOMs on a 24 GB card. This isn't a quirk of this lab — it's how you work with large models in any notebook.

Why plot the loss curve inline instead of using TensorBoard?

Because a 100-step training run finishes in ~3 minutes inside one notebook kernel; spinning up TensorBoard would take longer than watching the loss land. Step 4 also passes report_to='none' and disable_tqdm=True — tqdm progress bars render as literal \r garbage in notebook output — and uses a custom TrainerCallback.on_log that prints one clean Step N/M | loss=... | lr=... line per step. The final matplotlib cell reads trainer.state.log_history and draws training loss with validation checkpoints overlaid as red dots, right next to the cell that produced it.

What's the practical difference between PeftModel.from_pretrained and merge_and_unload()?

PeftModel.from_pretrained(base, path) loads the ~30 MB adapter on top of the base in memory — the base tensors are shared, not copied, so you can toggle the adapter on and off with disable_adapter_layers() / enable_adapter_layers(). That's what Step 6 uses to compare base-vs-fine-tuned perplexity from a single loaded model. merge_and_unload() in Step 7 computes W + (α/r) × B @ A for each adapted projection, writes that back into the base weights, drops the PEFT wrapper, and returns a plain LlamaForCausalLM you can hand to vLLM or TensorRT-LLM. One is a runtime composition; the other is a permanent bake-in.

Why does the lab use bnb_4bit_compute_dtype=torch.float16 if weights are stored in 4 bits?

Because compute precision and storage precision are independent knobs. NF4 storage keeps each weight as 4 bits plus a per-block absmax; when a matmul happens, those values are dequantised on the fly into the compute_dtype and the actual arithmetic runs in FP16. FP16 compute is fast on the RTX-class cards this sandbox uses and matches the activation dtype, so no extra casts. On H100 / A100 you'd typically switch to bnb_4bit_compute_dtype=torch.bfloat16 for better numerical headroom during backprop.

How much VRAM should I expect to see in Step 5's chart?

Llama 3 8B weighs about 32 GB in FP32, ~16 GB in FP16, ~8 GB in INT8, and ~5 GB in NF4. Step 5 measures the NF4 size directly via torch.cuda.memory_allocated() and reloads the FP16 variant in a comparison cell. The matplotlib chart overlays a 24 GB RTX 4090 limit line so the visual punchline lands: FP16 barely fits and leaves no room for optimizer state plus activations plus a KV cache for evaluation, while NF4 leaves ~19 GB free — which is what enables training plus eval plus a running notebook kernel to coexist without OOMs.

Can I re-run a single step without burning through the whole notebook again?

Yes. Because the kernel persists and Steps 2 and 4 save checkpoints to disk (train_data/, val_data/, lora-output/final/), you can close the tab at any point and reopen the notebook. Step 6's eval can be re-run from just PeftModel.from_pretrained(base_model, 'lora-output/final') without redoing Step 4. The same sandbox resumes in the same pod — a kernel restart plus re-running the cleanup cells is usually all you need. That's the entire reason the lab writes adapters to disk at Step 4 instead of holding them purely in memory.