Question 1

What's the actual difference between INT8 and NF4 in practice?

Accepted Answer

INT8 (LLM.int8()) uses signed 8-bit integers with a separate FP16 path for outlier features, halving weight memory. NF4 is a 4-bit format whose 16 quantiles are fitted to a standard normal distribution — close to how transformer weights are actually distributed — and combined with double-quantization of the scale factors it gets you roughly a 4× memory cut at 1–3% perplexity drift. Use INT8 when you have bandwidth to burn and outlier channels matter; NF4 when you want to fit a big model on a small card.

Question 2

Why might INT8 be slower than fp16 on my GPU?

Accepted Answer

Because `bitsandbytes` dequantizes weights back to fp16 for the actual matmul on GPUs without native INT8 Tensor Cores (most consumer cards — RTX 3060/3070/3080/3090, T4). The dequant overhead can dominate the savings. Native INT8 paths exist on A100/H100 + CUTLASS/cuBLAS or via TensorRT-LLM, and that's where the speedup shows up. Step 3's latency table makes this visceral — you're not imagining it.

Question 3

Is NF4 good enough for production inference, or just training?

Accepted Answer

NF4 was designed for QLoRA (fine-tuning big models on a single consumer GPU), and for inference it's fine for many use cases — the perplexity hit is usually 1–3%. For highest-quality production inference, calibration-based methods like GPTQ or AWQ typically edge it out because they use representative activations to pick quantization scales, and on H100 you'll likely want FP8 via TensorRT-LLM. The lab gives you the methodology to evaluate any of these on your own model.

Question 4

Do I need a calibration dataset to run this lab?

Accepted Answer

No. bitsandbytes is zero-calibration by design — you just load the model with the right config and it works. The tradeoff is that calibration-based 4-bit methods (GPTQ, AWQ) can reach better perplexity at the same bit width because they've seen real activations. For this lab you measure perplexity on a held-out English passage to compare precisions, but you're not calibrating the quantizer.

Question 5

How do I get trustworthy latency numbers when CUDA is asynchronous?

Accepted Answer

Warmup outside the timing region (one throwaway `generate()` call compiles kernels and primes caches), then wrap the timed region with `torch.cuda.synchronize()` on both ends. Without that sync, your timer captures Python returning from the launch call, not the GPU finishing the work. The Step 3 hint calls this out specifically because it's the single most common reason quantization benchmarks look wrong.

Question 6

How is each step graded?

Accepted Answer

Automatically by running checks against your variables. Step 1 validates `vram_fp16` is a finite positive number in the 1–6 GB range for TinyLlama. Step 2 enforces `vram_nf4 < vram_int8 < vram_fp16` with plausible ratios. Step 3 requires a `latencies` dict with `fp16`/`int8`/`nf4` keys and positive values. Step 4 runs all three perplexities and checks the INT8/fp16 and NF4/fp16 ratios stay below 1.3× and 1.5× respectively.

Quantize & Optimize LLMs with bitsandbytes

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll measure in this quantization lab

Frequently asked questions