Fine-Tune an LLM with LoRA and QLoRA
Fine-tune Meta Llama 3 8B on a custom instruction dataset using LoRA and QLoRA. Learn parameter-efficient fine-tuning from data preparation through evaluation — the #1 most demanded AI skill.
What you'll learn
- 1Explore the Base ModelIn this lab, you'll take Meta Llama 3 8B — a raw base model with no instruction-following ability — and fine-tune it into a medical Q&A specialist. By the end, your model will go from incoherent text completion to structured, accurate clinical answers.
- 2Prepare the Training DatasetFine-tuning is only as good as your data. A poorly formatted dataset teaches the model bad habits. A well-structured one makes it dramatically better.
- 3Understand LoRA — How It WorksLoRA (Low-Rank Adaptation) is the most popular parameter-efficient fine-tuning method. Instead of updating all 8 billion parameters, LoRA freezes the original model and injects small trainable matrices into specific layers.
- 4Fine-Tune with QLoRANow we combine everything: the tokenized dataset from Step 2, LoRA adapters from Step 3, and Hugging Face's Trainer API.
- 5Understanding Quantization — FP16 vs 4-bitYou just trained with QLoRA — but what exactly does "4-bit" mean, and when would you choose differently?
- 6Evaluate the Fine-Tuned ModelThis is the payoff. You'll compare the raw base model (which can't follow instructions) against your fine-tuned version (which can).
- 7Merge and ExportRight now your fine-tuned model is stored as two parts:
Prerequisites
- Basic Python (functions, loops, dicts)
- Familiarity with PyTorch tensors
- Understanding of what LLMs are and how they generate text
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this LoRA fine-tuning lab
Across seven steps you'll turn a raw Llama 3 8B base model into a medical-Q&A specialist using QLoRA, and prove the change with perplexity. Step 1 loads the base checkpoint at /models/meta-llama--Meta-Llama-3-8B in FP16 and fires the same three clinical prompts you'll revisit after training. Step 2 filters the Alpaca dataset down to 2,000 medical examples, formats them into the ### Instruction: ... ### Response: ... template, and saves tokenized train/val splits to disk. Step 3 constructs a LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','k_proj','v_proj','o_proj'], task_type=TaskType.CAUSAL_LM) and wraps the model with get_peft_model. Step 4 re-loads the base in 4-bit via BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', ...), calls prepare_model_for_kbit_training, and runs the Hugging Face Trainer for 100 steps with per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4. Step 5 is a quantization deep dive comparing FP16 vs NF4 VRAM. Step 6 toggles disable_adapter_layers() and enable_adapter_layers() on a single PeftModel to measure base vs adapted perplexity on the held-out set. Step 7 calls merge_and_unload() and exports a deployable checkpoint.
The mental model you leave with: LoRA isn't a smaller model, it's a structural shortcut. You freeze the 8B base weights and train two tiny matrices B and A per attention projection, so the effective update W + BA is learnable with ~0.1% of the parameters and a fraction of the optimizer memory. QLoRA adds the trick that makes it fit a consumer 24 GB card: base weights stored as NF4 (a data-distribution-aware 4-bit format with double quantisation), compute done in FP16, gradients flowing only into the LoRA adapters. gradient_checkpointing_enable() trades recompute for activation memory; bnb_4bit_use_double_quant=True quantises the quantisation constants themselves. The live loss curve plus the base-vs-adapted perplexity bar chart are the ground truth that the recipe actually works.
Prerequisites are a little Python, a rough sense of PyTorch tensors, and knowing what next-token prediction is — no fine-tuning experience assumed; the lab walks you through PEFT from first principles. Budget about 45 minutes on the real NVIDIA GPU sandbox we provision per session: the base model, the Alpaca dataset, transformers, peft, bitsandbytes, and Trainer are all preinstalled. Each of the seven steps is auto-graded by a notebook check: Step 2 verifies the tokenized datasets landed on disk, Step 4 opens adapter_config.json and confirms r=16 plus the expected target_modules, Step 6 requires ft_ppl < base_ppl, Step 7 checks that safetensors and tokenizer files are present under merged-model/.
Frequently asked questions
Why does rank=16 update less than 1% of the weights but still move perplexity meaningfully?
What does merge_and_unload() actually do and when would I NOT call it?
merge_and_unload() actually do and when would I NOT call it?LlamaForCausalLM. Deployment-friendly: vLLM, TensorRT-LLM, and Triton all accept it as a normal model. You would NOT merge if you want to hot-swap multiple adapters on one base at serve time (vLLM's --enable-lora path), or if you're stacking several LoRAs and plan to scale α differently per request. Merge is a one-way door.Why NF4 specifically, not INT4?
bnb_4bit_use_double_quant=True, which quantises the per-block absmax constants themselves, you end up around 0.5 extra bits per weight saved versus single-level NF4. The result is <1% quality loss against FP16 on most instruction-tuning benchmarks.Why is learning_rate=2e-4 so much higher than a full fine-tune would use?
learning_rate=2e-4 so much higher than a full fine-tune would use?Can I really fine-tune an 8B model on a single 24 GB GPU?
gradient_checkpointing_enable() stay below 10 GB at per_device_train_batch_size=4 and max_length=512. The effective batch size of 16 comes from gradient_accumulation_steps=4, which costs wall time but zero VRAM. If you tried to do the same job in FP16 without LoRA you'd need about 80 GB for weights + Adam state alone. The sandbox pod we provision sits comfortably in the QLoRA envelope.The check says ft_ppl < base_ppl — what counts as a good improvement?
ft_ppl < base_ppl — what counts as a good improvement?