Quantize & Optimize LLMs with bitsandbytes
Load a model in fp16, INT8, and NF4, then benchmark the three precisions on VRAM, latency, and output quality. See where quantization wins and where it costs you.
What you'll learn
- 1Load TinyLlama in fp16 (the baseline)
- 2Load in INT8 and NF4 with bitsandbytes
- 3Latency benchmark across precisions
- 4Quality: does quantization actually preserve accuracy?
Prerequisites
- Comfortable loading HuggingFace models with transformers
- Basic PyTorch + CUDA tensor operations
- Rough intuition for fp32 / fp16 / int8 bit widths
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll measure in this quantization lab
Quantization is the skill that decides whether you can deploy an LLM at all — it's the difference between a model that fits on the GPU you have versus one you couldn't afford to run. In 40 minutes you'll produce the evaluation matrix any engineer should own before picking a precision in production: VRAM, generation latency per token, and perplexity drift for FP16, INT8, and NF4 on the same model, measured with the methodology that actually holds up (warmup runs, torch.cuda.synchronize() around timed regions, labels=input_ids perplexity on a held-out passage). You'll leave with a mental model of why NF4 enables QLoRA on a single consumer GPU, why INT8 can be slower than FP16 on cards without native INT8 Tensor Cores, and the exact shape of the memory/latency/quality tradeoff — numbers you can cite, not theory.
The technical substance lives in the flag combinations and the measurements. INT8 via BitsAndBytesConfig(load_in_8bit=True) uses LLM.int8() with outlier-aware mixed-decomposition — signed 8-bit integers for normal channels, FP16 for the outlier columns that carry disproportionate signal. NF4 via the four-flag recipe (load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16) stores weights as 16 quantiles fitted to a standard normal distribution — close to how transformer weights are actually distributed — with double-quantization of the scale factors for an extra ~0.4 bits/param. You'll watch NF4 drop VRAM to roughly a quarter of FP16 at 1–3% perplexity drift, which is the whole reason fine-tuning an 8B model on a 24 GB card is possible. You'll also see why raw QPS is a lie in quantization benchmarks: without a warmup generation and synchronized timing, your numbers measure Python returning from the launch call, not the GPU finishing work. Step 4's reflection grounds out in when to graduate past bitsandbytes to calibration-based methods (GPTQ, AWQ) or FP8 on H100 via TensorRT-LLM.
Prerequisites are comfort loading HuggingFace models with transformers, basic PyTorch on CUDA, and a rough intuition for FP32/FP16/INT8 bit widths. The sandbox is a real NVIDIA GPU pod we provision per session — bitsandbytes, transformers, and a matched CUDA runtime are preinstalled, and TinyLlama-1.1B is staged locally so the first from_pretrained doesn't burn your budget on a download. Checks enforce plausible measurements rather than magic numbers: FP16 VRAM in the 1–6 GB band for TinyLlama, INT8/FP16 ratio between 0.3–0.8, NF4/FP16 ratio 0.15–0.5, INT8 perplexity degradation under 30%, NF4 under 50%.
Frequently asked questions
What's the actual difference between INT8 and NF4 in practice?
Why might INT8 be slower than fp16 on my GPU?
bitsandbytes dequantizes weights back to fp16 for the actual matmul on GPUs without native INT8 Tensor Cores (most consumer cards — RTX 3060/3070/3080/3090, T4). The dequant overhead can dominate the savings. Native INT8 paths exist on A100/H100 + CUTLASS/cuBLAS or via TensorRT-LLM, and that's where the speedup shows up. Step 3's latency table makes this visceral — you're not imagining it.Is NF4 good enough for production inference, or just training?
Do I need a calibration dataset to run this lab?
How do I get trustworthy latency numbers when CUDA is asynchronous?
generate() call compiles kernels and primes caches), then wrap the timed region with torch.cuda.synchronize() on both ends. Without that sync, your timer captures Python returning from the launch call, not the GPU finishing the work. The Step 3 hint calls this out specifically because it's the single most common reason quantization benchmarks look wrong.How is each step graded?
vram_fp16 is a finite positive number in the 1–6 GB range for TinyLlama. Step 2 enforces vram_nf4 < vram_int8 < vram_fp16 with plausible ratios. Step 3 requires a latencies dict with fp16/int8/nf4 keys and positive values. Step 4 runs all three perplexities and checks the INT8/fp16 and NF4/fp16 ratios stay below 1.3× and 1.5× respectively.