Question 1

Why does rank=16 update less than 1% of the weights but still move perplexity meaningfully?

Accepted Answer

Because the gradient signal on a specialization task lives in a low-dimensional subspace of the 4096×4096 attention matrices. LoRA parameterises the update as W + BA where A is (4096, r) and B is (r, 4096), so at r=16 you train 2 × 16 × 4096 = 131,072 parameters per projection instead of 16.7 million — a 128× compression. The LoRA paper and every subsequent ablation show r=8-16 captures essentially all of the gain on instruction-tuning style tasks. Higher rank mostly adds overfitting risk, not capacity.

Question 2

What does `merge_and_unload()` actually do and when would I NOT call it?

Accepted Answer

It computes W' = W + (α/r) × B @ A for each adapted projection, replaces the frozen weights with W' in place, and strips the PEFT wrapper so the result is a plain `LlamaForCausalLM`. Deployment-friendly: vLLM, TensorRT-LLM, and Triton all accept it as a normal model. You would NOT merge if you want to hot-swap multiple adapters on one base at serve time (vLLM's `--enable-lora` path), or if you're stacking several LoRAs and plan to scale α differently per request. Merge is a one-way door.

Question 3

Why NF4 specifically, not INT4?

Accepted Answer

Neural network weights are approximately normally distributed around zero. INT4 assigns quantisation bins uniformly, wasting resolution on rare large values. NF4 (NormalFloat4) places its 16 bins at the theoretical quantiles of a unit normal, so each bin carries equal weight-probability mass — you get lower reconstruction error for the same 4 bits per parameter. Combined with `bnb_4bit_use_double_quant=True`, which quantises the per-block absmax constants themselves, you end up around 0.5 extra bits per weight saved versus single-level NF4. The result is <1% quality loss against FP16 on most instruction-tuning benchmarks.

Question 4

Why is `learning_rate=2e-4` so much higher than a full fine-tune would use?

Accepted Answer

Full fine-tunes of Llama-class models typically run at 1e-5 to 5e-5 because every parameter is moving and the loss landscape is fragile. LoRA only trains a tiny subspace of freshly-initialised adapter weights (A is Kaiming-init, B is zero — the initial update is exactly zero, so the model starts identical to the base), so it tolerates — and benefits from — a much larger LR. 1e-4 to 3e-4 with warmup is the standard LoRA/QLoRA range. Lower and 100 steps isn't enough to move the adapters; higher and training becomes unstable on small datasets.

Question 5

Can I really fine-tune an 8B model on a single 24 GB GPU?

Accepted Answer

Yes, and QLoRA is specifically designed to make it fit. The 4-bit base is about 5 GB. Adapter weights plus their optimizer state are under 1 GB. Activations with `gradient_checkpointing_enable()` stay below 10 GB at `per_device_train_batch_size=4` and `max_length=512`. The effective batch size of 16 comes from `gradient_accumulation_steps=4`, which costs wall time but zero VRAM. If you tried to do the same job in FP16 without LoRA you'd need about 80 GB for weights + Adam state alone. The sandbox pod we provision sits comfortably in the QLoRA envelope.

Question 6

The check says `ft_ppl < base_ppl` — what counts as a good improvement?

Accepted Answer

On 100 training steps over the medical-Alpaca slice with the hyperparameters in Step 4, most runs drop perplexity on the held-out validation set by 15-40% relative to the base. That's a meaningful signal — the adapters learned the instruction-response format plus some medical terminology distribution — without being so large that it implies overfitting to 1,800 training examples. Longer training or higher rank pushes the gap wider but also raises the risk of catastrophic forgetting on general text; you can see that tradeoff up close in the continued-pretraining lab.

Fine-Tune an LLM with LoRA and QLoRA

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this LoRA fine-tuning lab

Frequently asked questions

Why does rank=16 update less than 1% of the weights but still move perplexity meaningfully?

What does `merge_and_unload()` actually do and when would I NOT call it?

Why NF4 specifically, not INT4?

Why is `learning_rate=2e-4` so much higher than a full fine-tune would use?

Can I really fine-tune an 8B model on a single 24 GB GPU?

The check says `ft_ppl < base_ppl` — what counts as a good improvement?

Fine-Tune an LLM with LoRA and QLoRA

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this LoRA fine-tuning lab

Frequently asked questions

Why does rank=16 update less than 1% of the weights but still move perplexity meaningfully?

What does merge_and_unload() actually do and when would I NOT call it?

Why NF4 specifically, not INT4?

Why is learning_rate=2e-4 so much higher than a full fine-tune would use?

Can I really fine-tune an 8B model on a single 24 GB GPU?

The check says ft_ppl < base_ppl — what counts as a good improvement?

What does `merge_and_unload()` actually do and when would I NOT call it?

Why is `learning_rate=2e-4` so much higher than a full fine-tune would use?

The check says `ft_ppl < base_ppl` — what counts as a good improvement?