Batch Size & Precision Sweep: Finding Your Sweet Spot
Sweep batch sizes and numerical precisions (fp32, fp16, bf16) on a real model to find the throughput/VRAM knee, then ship a production recommendation with SKU-aware precision picks and an accuracy gate.
What you'll learn
- 1Batch size sweep
- 2Precision sweep
- 3Combined sweep — pick the actual production config
- 4Ship the production recommendation
Prerequisites
- Comfortable with PyTorch model forward passes
- Basic understanding of GPU memory and throughput
- Familiarity with fp16/bf16/fp32 at a conceptual level
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll sweep in this batch-and-precision lab
'What batch size and precision should we deploy?' is the question every inference team gets asked the week before launch, and the wrong answer shows up as either a 3 AM OOM or a half-GPU burned for nothing. This lab replaces guessing with a measured sweep. You'll walk away with a throughput-vs-VRAM knee for batch size, a precision comparison across fp32/fp16/bf16 with real output-drift numbers, a cross-product sweep that picks the actual deployment config inside a VRAM budget, and a production_recommendation dict plus an accuracy_gate spec your CI can enforce to block silent precision regressions. About 40 minutes on a live NVIDIA GPU pod — PyTorch, CUDA, and autocast are preinstalled.
The substance is where the real insights compound. Batch sweep: throughput flattens before VRAM runs out, because once you saturate compute units, bigger batches buy you latency, not samples/sec — the knee is where marginal throughput/marginal VRAM drops below useful. Precision sweep: fp16 typically lands ≥1.15× fp32 throughput with cosine similarity >0.95 on well-conditioned transformers, but bf16 keeps fp32's 8-bit exponent range (only 7 mantissa bits) which resists overflow on models with wide activations where fp16's 5-bit exponent struggles. Combined sweep: changing precision and raising batch simultaneously usually delivers a 3×+ lift vs fp32 @ batch=1 inside the same memory envelope — that's where the real deployment win lives. The SKU twist: H100 has native FP8 Tensor Cores (E4M3/E5M2) and markedly stronger bf16 throughput, so FP8 via TensorRT-LLM or vLLM is often the H100 choice; A100 lacks hardware FP8, so bf16 dominates. Picking the same precision for both SKUs wastes H100 capacity or destabilizes A100.
The judgment trap: cosine similarity is a sanity check, not a quality gate. Two outputs can be cos-sim 0.99 and still produce meaningfully different generations because activation vectors are close in Euclidean terms but decode to different tokens. The real release gate is a task-level metric — accuracy, BLEU, perplexity, grounded-answer rate — measured on a representative set. The second trap: measured peak VRAM on a synthetic input isn't production peak. KV-cache spikes from long sequences, fragmentation from variable-length requests, or a batch that lands with unusually tall activations can push a tightly-fitted config OOM. The 5-40% oom_margin_pct band exists because below 5% you'll see OOMs in production and above 40% you're burning a half-GPU.
Prereqs: PyTorch forward-pass comfort, a rough feel for GPU memory, conceptual familiarity with fp16/bf16/fp32. Preinstalled: PyTorch, CUDA, torch.autocast, JupyterLab. Grading enforces real structure: the batch sweep needs ≥4 distinct ordered batch sizes with ≥2× max/min throughput speedup, the precision sweep requires all three precisions with fp16 ≥1.15× fp32 and cos-sim >0.95 for fp16/bf16, the combined sweep must deliver ≥3× over fp32 @ batch=1, the sku_to_preferred_precision map must treat A100 and H100 distinctly, and accuracy_gate must have metric + threshold + measurement method.
Frequently asked questions
Why sweep batch size instead of just picking the biggest that fits?
fp16 vs bf16 — when does the choice actually matter?
Why does the production recommendation include an OOM safety margin?
oom_margin_pct because that's the band where you're paranoid but not wasteful. Below 5% you will see OOMs in production; above 40% you're burning a half-GPU for nothing.Is cosine similarity a good enough quality gate?
accuracy_gate with a task-level metric instead. Two model outputs can be cosine similarity 0.99 and still produce meaningfully different generations — the activation vectors are close in Euclidean terms but decode to different tokens. Cosine similarity is a useful sanity check for finding gross regressions (a precision mode that broke the model); the real release gate is the downstream metric users actually care about — accuracy, BLEU, perplexity, grounded-answer rate — measured on a representative set.Why should my A100 and H100 preferred precisions differ?
sku_to_preferred_precision map in Step 4 is how you surface that difference to whoever runs the deployment.What do the grading checks actually enforce?
knee_batch_size that's one of the sampled batches. Step 2 requires entries for fp32, fp16, and bf16 with fp16 ≥1.15× fp32 throughput and cosine similarity >0.95 for the non-fp32 modes. Step 3 requires ≥6 combined points, an fp32 @ batch=1 baseline, a best_config that fits its stated VRAM budget and delivers ≥3× over the baseline. Step 4 requires a complete production_recommendation, a non-trivial SKU map, and an accuracy_gate dict.