Question 1

Why do I need to explicitly reset IPython output history between the LoRA steps?

Accepted Answer

Because JupyterLab silently caches every cell's return value in `Out[<n>]` plus the `_`, `__`, `___` shortcuts. If a cell ever returned a tensor or a model, that reference pins the whole computation graph — `del model` doesn't free it. Step 4 unloads the Step 3 model before loading the 4-bit base, and the cleanup has to call `get_ipython().run_line_magic('reset', 'out -f')` plus `gc.collect()` plus `torch.cuda.empty_cache()` in that order. Skip the history reset and the new QLoRA load OOMs on a 24 GB card. This isn't a quirk of this lab — it's how you work with large models in any notebook.

Question 2

Why plot the loss curve inline instead of using TensorBoard?

Accepted Answer

Because a 100-step training run finishes in ~3 minutes inside one notebook kernel; spinning up TensorBoard would take longer than watching the loss land. Step 4 also passes `report_to='none'` and `disable_tqdm=True` — tqdm progress bars render as literal `\r` garbage in notebook output — and uses a custom `TrainerCallback.on_log` that prints one clean `Step N/M | loss=... | lr=...` line per step. The final matplotlib cell reads `trainer.state.log_history` and draws training loss with validation checkpoints overlaid as red dots, right next to the cell that produced it.

Question 3

What's the practical difference between `PeftModel.from_pretrained` and `merge_and_unload()`?

Accepted Answer

`PeftModel.from_pretrained(base, path)` loads the ~30 MB adapter on top of the base in memory — the base tensors are shared, not copied, so you can toggle the adapter on and off with `disable_adapter_layers()` / `enable_adapter_layers()`. That's what Step 6 uses to compare base-vs-fine-tuned perplexity from a single loaded model. `merge_and_unload()` in Step 7 computes `W + (α/r) × B @ A` for each adapted projection, writes that back into the base weights, drops the PEFT wrapper, and returns a plain `LlamaForCausalLM` you can hand to vLLM or TensorRT-LLM. One is a runtime composition; the other is a permanent bake-in.

Question 4

Why does the lab use `bnb_4bit_compute_dtype=torch.float16` if weights are stored in 4 bits?

Accepted Answer

Because compute precision and storage precision are independent knobs. NF4 storage keeps each weight as 4 bits plus a per-block absmax; when a matmul happens, those values are dequantised on the fly into the `compute_dtype` and the actual arithmetic runs in FP16. FP16 compute is fast on the RTX-class cards this sandbox uses and matches the activation dtype, so no extra casts. On H100 / A100 you'd typically switch to `bnb_4bit_compute_dtype=torch.bfloat16` for better numerical headroom during backprop.

Question 5

How much VRAM should I expect to see in Step 5's chart?

Accepted Answer

Llama 3 8B weighs about 32 GB in FP32, ~16 GB in FP16, ~8 GB in INT8, and ~5 GB in NF4. Step 5 measures the NF4 size directly via `torch.cuda.memory_allocated()` and reloads the FP16 variant in a comparison cell. The matplotlib chart overlays a 24 GB RTX 4090 limit line so the visual punchline lands: FP16 barely fits and leaves no room for optimizer state plus activations plus a KV cache for evaluation, while NF4 leaves ~19 GB free — which is what enables training plus eval plus a running notebook kernel to coexist without OOMs.

Question 6

Can I re-run a single step without burning through the whole notebook again?

Accepted Answer

Yes. Because the kernel persists and Steps 2 and 4 save checkpoints to disk (`train_data/`, `val_data/`, `lora-output/final/`), you can close the tab at any point and reopen the notebook. Step 6's eval can be re-run from just `PeftModel.from_pretrained(base_model, 'lora-output/final')` without redoing Step 4. The same sandbox resumes in the same pod — a kernel restart plus re-running the cleanup cells is usually all you need. That's the entire reason the lab writes adapters to disk at Step 4 instead of holding them purely in memory.

Fine-Tune an LLM with LoRA and QLoRA (Jupyter)

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this QLoRA-in-JupyterLab lab

Frequently asked questions

Why do I need to explicitly reset IPython output history between the LoRA steps?

Why plot the loss curve inline instead of using TensorBoard?

What's the practical difference between `PeftModel.from_pretrained` and `merge_and_unload()`?

Why does the lab use `bnb_4bit_compute_dtype=torch.float16` if weights are stored in 4 bits?

How much VRAM should I expect to see in Step 5's chart?

Can I re-run a single step without burning through the whole notebook again?

Fine-Tune an LLM with LoRA and QLoRA (Jupyter)

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this QLoRA-in-JupyterLab lab

Frequently asked questions

Why do I need to explicitly reset IPython output history between the LoRA steps?

Why plot the loss curve inline instead of using TensorBoard?

What's the practical difference between PeftModel.from_pretrained and merge_and_unload()?

Why does the lab use bnb_4bit_compute_dtype=torch.float16 if weights are stored in 4 bits?

How much VRAM should I expect to see in Step 5's chart?

Can I re-run a single step without burning through the whole notebook again?

What's the practical difference between `PeftModel.from_pretrained` and `merge_and_unload()`?

Why does the lab use `bnb_4bit_compute_dtype=torch.float16` if weights are stored in 4 bits?