Fine-Tune an LLM with LoRA and QLoRA (Jupyter)
Fine-tune Meta Llama 3 8B on a custom instruction dataset using LoRA and QLoRA. Learn parameter-efficient fine-tuning from data preparation through evaluation — the #1 most demanded AI skill.
What you'll learn
- 1Explore the Base Model
- 2Prepare the Training Dataset
- 3Understand LoRA — How It Works
- 4Fine-Tune with QLoRA
- 5Understanding Quantization — FP16 vs 4-bit
- 6Evaluate the Fine-Tuned Model
- 7Merge and Export
Prerequisites
- Basic Python (functions, loops, dicts)
- Familiarity with PyTorch tensors
- Understanding of what LLMs are and how they generate text
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this QLoRA-in-JupyterLab lab
LoRA and QLoRA are the highest-leverage LLM skills of 2026 — fine-tuning a Llama-class model on a consumer-grade GPU used to be a multi-node distributed training job, and now it's a 45-minute project you run in one notebook. You'll walk away with a fine-tuned Llama 3 8B medical Q&A model merged to a deployable checkpoint, a working mental model of rank (r=16), alpha (α=32), and which projections to adapt (q_proj, k_proj, v_proj, o_proj), the VRAM math that separates FP16 (won't fit on a 4090 once you add optimizer state) from NF4 (15+ GB headroom for training plus eval plus a running kernel), and live loss/perplexity curves rendered inline with matplotlib as the run progresses. No pip install, no CUDA driver wrangling — you land in a live NVIDIA pod with Llama 3, PEFT, bitsandbytes, and a JupyterLab kernel already running.
The technical substance is the three-way interplay between parameter-efficient fine-tuning, 4-bit quantization, and notebook-resident state. LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','k_proj','v_proj','o_proj']) trains under 1% of parameters while still shifting the model's behavior; BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16) stores weights as NF4 quantiles fitted to the normal distribution then dequantizes on the fly for FP16 compute. You'll see directly why merge_and_unload() (the W + (α/r) × B @ A bake-in that hands you a plain LlamaForCausalLM) is different from PeftModel.from_pretrained (runtime composition with togglable adapters), and you'll hit the Jupyter-specific trap that catches everyone their first time: del model doesn't free VRAM because IPython's Out[] cache plus _, __, ___ pin the whole computation graph. The correct cleanup is %reset out -f then gc.collect() then torch.cuda.empty_cache() in that order — otherwise the 4-bit reload OOMs on a 24 GB card despite the NF4 footprint being a quarter of FP16.
Prerequisites are basic Python, a rough grip on PyTorch tensors, and knowing that LLMs predict the next token — no prior PEFT or QLoRA experience assumed. The sandbox is a real NVIDIA GPU pod we provision per session with JupyterLab, HuggingFace transformers, peft, bitsandbytes, datasets, matplotlib, and the Llama 3 8B base checkpoint all preinstalled and pinned to compatible versions. Checks run inline after each step (from preporato_labs import Lab; lab = Lab(...).check(N)) against kernel-resident state and on-disk artifacts — adapter_config.json must show r==16 with q_proj in target_modules, ft_ppl < base_ppl after training, and safetensors under merged-model/ at the end. The merged checkpoint is drop-in ready for vLLM or TensorRT-LLM, which is where you take it next.
Frequently asked questions
Why do I need to explicitly reset IPython output history between the LoRA steps?
Out[<n>] plus the _, __, ___ shortcuts. If a cell ever returned a tensor or a model, that reference pins the whole computation graph — del model doesn't free it. Step 4 unloads the Step 3 model before loading the 4-bit base, and the cleanup has to call get_ipython().run_line_magic('reset', 'out -f') plus gc.collect() plus torch.cuda.empty_cache() in that order. Skip the history reset and the new QLoRA load OOMs on a 24 GB card. This isn't a quirk of this lab — it's how you work with large models in any notebook.Why plot the loss curve inline instead of using TensorBoard?
report_to='none' and disable_tqdm=True — tqdm progress bars render as literal \r garbage in notebook output — and uses a custom TrainerCallback.on_log that prints one clean Step N/M | loss=... | lr=... line per step. The final matplotlib cell reads trainer.state.log_history and draws training loss with validation checkpoints overlaid as red dots, right next to the cell that produced it.What's the practical difference between PeftModel.from_pretrained and merge_and_unload()?
PeftModel.from_pretrained and merge_and_unload()?PeftModel.from_pretrained(base, path) loads the ~30 MB adapter on top of the base in memory — the base tensors are shared, not copied, so you can toggle the adapter on and off with disable_adapter_layers() / enable_adapter_layers(). That's what Step 6 uses to compare base-vs-fine-tuned perplexity from a single loaded model. merge_and_unload() in Step 7 computes W + (α/r) × B @ A for each adapted projection, writes that back into the base weights, drops the PEFT wrapper, and returns a plain LlamaForCausalLM you can hand to vLLM or TensorRT-LLM. One is a runtime composition; the other is a permanent bake-in.Why does the lab use bnb_4bit_compute_dtype=torch.float16 if weights are stored in 4 bits?
bnb_4bit_compute_dtype=torch.float16 if weights are stored in 4 bits?compute_dtype and the actual arithmetic runs in FP16. FP16 compute is fast on the RTX-class cards this sandbox uses and matches the activation dtype, so no extra casts. On H100 / A100 you'd typically switch to bnb_4bit_compute_dtype=torch.bfloat16 for better numerical headroom during backprop.How much VRAM should I expect to see in Step 5's chart?
torch.cuda.memory_allocated() and reloads the FP16 variant in a comparison cell. The matplotlib chart overlays a 24 GB RTX 4090 limit line so the visual punchline lands: FP16 barely fits and leaves no room for optimizer state plus activations plus a KV cache for evaluation, while NF4 leaves ~19 GB free — which is what enables training plus eval plus a running notebook kernel to coexist without OOMs.Can I re-run a single step without burning through the whole notebook again?
train_data/, val_data/, lora-output/final/), you can close the tab at any point and reopen the notebook. Step 6's eval can be re-run from just PeftModel.from_pretrained(base_model, 'lora-output/final') without redoing Step 4. The same sandbox resumes in the same pod — a kernel restart plus re-running the cleanup cells is usually all you need. That's the entire reason the lab writes adapters to disk at Step 4 instead of holding them purely in memory.