Quantize & Optimize LLMs with bitsandbytes
GPU sandbox · jupyter
Beta

Quantize & Optimize LLMs with bitsandbytes

Load a model in fp16, INT8, and NF4, then benchmark the three precisions on VRAM, latency, and output quality. See where quantization wins and where it costs you.

40 min·4 steps·2 domains·Intermediate·ncp-genlncp-adsnca-genlnca-genm

What you'll learn

  1. 1
    Load TinyLlama in fp16 (the baseline)
  2. 2
    Load in INT8 and NF4 with bitsandbytes
  3. 3
    Latency benchmark across precisions
  4. 4
    Quality: does quantization actually preserve accuracy?

Prerequisites

  • Comfortable loading HuggingFace models with transformers
  • Basic PyTorch + CUDA tensor operations
  • Rough intuition for fp32 / fp16 / int8 bit widths

Exam domains covered

Model Deployment & Inference OptimizationGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

QuantizationbitsandbytesINT8NF4LLM.int8QLoRAPrecision

What you'll measure in this quantization lab

Quantization is the skill that decides whether you can deploy an LLM at all — it's the difference between a model that fits on the GPU you have versus one you couldn't afford to run. In 40 minutes you'll produce the evaluation matrix any engineer should own before picking a precision in production: VRAM, generation latency per token, and perplexity drift for FP16, INT8, and NF4 on the same model, measured with the methodology that actually holds up (warmup runs, torch.cuda.synchronize() around timed regions, labels=input_ids perplexity on a held-out passage). You'll leave with a mental model of why NF4 enables QLoRA on a single consumer GPU, why INT8 can be slower than FP16 on cards without native INT8 Tensor Cores, and the exact shape of the memory/latency/quality tradeoff — numbers you can cite, not theory.

The technical substance lives in the flag combinations and the measurements. INT8 via BitsAndBytesConfig(load_in_8bit=True) uses LLM.int8() with outlier-aware mixed-decomposition — signed 8-bit integers for normal channels, FP16 for the outlier columns that carry disproportionate signal. NF4 via the four-flag recipe (load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16) stores weights as 16 quantiles fitted to a standard normal distribution — close to how transformer weights are actually distributed — with double-quantization of the scale factors for an extra ~0.4 bits/param. You'll watch NF4 drop VRAM to roughly a quarter of FP16 at 1–3% perplexity drift, which is the whole reason fine-tuning an 8B model on a 24 GB card is possible. You'll also see why raw QPS is a lie in quantization benchmarks: without a warmup generation and synchronized timing, your numbers measure Python returning from the launch call, not the GPU finishing work. Step 4's reflection grounds out in when to graduate past bitsandbytes to calibration-based methods (GPTQ, AWQ) or FP8 on H100 via TensorRT-LLM.

Prerequisites are comfort loading HuggingFace models with transformers, basic PyTorch on CUDA, and a rough intuition for FP32/FP16/INT8 bit widths. The sandbox is a real NVIDIA GPU pod we provision per session — bitsandbytes, transformers, and a matched CUDA runtime are preinstalled, and TinyLlama-1.1B is staged locally so the first from_pretrained doesn't burn your budget on a download. Checks enforce plausible measurements rather than magic numbers: FP16 VRAM in the 1–6 GB band for TinyLlama, INT8/FP16 ratio between 0.3–0.8, NF4/FP16 ratio 0.15–0.5, INT8 perplexity degradation under 30%, NF4 under 50%.

Frequently asked questions

What's the actual difference between INT8 and NF4 in practice?

INT8 (LLM.int8()) uses signed 8-bit integers with a separate FP16 path for outlier features, halving weight memory. NF4 is a 4-bit format whose 16 quantiles are fitted to a standard normal distribution — close to how transformer weights are actually distributed — and combined with double-quantization of the scale factors it gets you roughly a 4× memory cut at 1–3% perplexity drift. Use INT8 when you have bandwidth to burn and outlier channels matter; NF4 when you want to fit a big model on a small card.

Why might INT8 be slower than fp16 on my GPU?

Because bitsandbytes dequantizes weights back to fp16 for the actual matmul on GPUs without native INT8 Tensor Cores (most consumer cards — RTX 3060/3070/3080/3090, T4). The dequant overhead can dominate the savings. Native INT8 paths exist on A100/H100 + CUTLASS/cuBLAS or via TensorRT-LLM, and that's where the speedup shows up. Step 3's latency table makes this visceral — you're not imagining it.

Is NF4 good enough for production inference, or just training?

NF4 was designed for QLoRA (fine-tuning big models on a single consumer GPU), and for inference it's fine for many use cases — the perplexity hit is usually 1–3%. For highest-quality production inference, calibration-based methods like GPTQ or AWQ typically edge it out because they use representative activations to pick quantization scales, and on H100 you'll likely want FP8 via TensorRT-LLM. The lab gives you the methodology to evaluate any of these on your own model.

Do I need a calibration dataset to run this lab?

No. bitsandbytes is zero-calibration by design — you just load the model with the right config and it works. The tradeoff is that calibration-based 4-bit methods (GPTQ, AWQ) can reach better perplexity at the same bit width because they've seen real activations. For this lab you measure perplexity on a held-out English passage to compare precisions, but you're not calibrating the quantizer.

How do I get trustworthy latency numbers when CUDA is asynchronous?

Warmup outside the timing region (one throwaway generate() call compiles kernels and primes caches), then wrap the timed region with torch.cuda.synchronize() on both ends. Without that sync, your timer captures Python returning from the launch call, not the GPU finishing the work. The Step 3 hint calls this out specifically because it's the single most common reason quantization benchmarks look wrong.

How is each step graded?

Automatically by running checks against your variables. Step 1 validates vram_fp16 is a finite positive number in the 1–6 GB range for TinyLlama. Step 2 enforces vram_nf4 < vram_int8 < vram_fp16 with plausible ratios. Step 3 requires a latencies dict with fp16/int8/nf4 keys and positive values. Step 4 runs all three perplexities and checks the INT8/fp16 and NF4/fp16 ratios stay below 1.3× and 1.5× respectively.