Batch Size & Precision Sweep: Finding Your Sweet Spot
GPU sandbox · jupyter
Beta

Batch Size & Precision Sweep: Finding Your Sweet Spot

Sweep batch sizes and numerical precisions (fp32, fp16, bf16) on a real model to find the throughput/VRAM knee, then ship a production recommendation with SKU-aware precision picks and an accuracy gate.

40 min·4 steps·2 domains·Intermediate·ncp-genlnca-aiioncp-adsnca-genlnca-genm

What you'll learn

  1. 1
    Batch size sweep
  2. 2
    Precision sweep
  3. 3
    Combined sweep — pick the actual production config
  4. 4
    Ship the production recommendation

Prerequisites

  • Comfortable with PyTorch model forward passes
  • Basic understanding of GPU memory and throughput
  • Familiarity with fp16/bf16/fp32 at a conceptual level

Exam domains covered

Model Deployment & Inference OptimizationGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

InferenceThroughputVRAMMixed Precisionfp16bf16BenchmarkingCapacity Planning

What you'll sweep in this batch-and-precision lab

'What batch size and precision should we deploy?' is the question every inference team gets asked the week before launch, and the wrong answer shows up as either a 3 AM OOM or a half-GPU burned for nothing. This lab replaces guessing with a measured sweep. You'll walk away with a throughput-vs-VRAM knee for batch size, a precision comparison across fp32/fp16/bf16 with real output-drift numbers, a cross-product sweep that picks the actual deployment config inside a VRAM budget, and a production_recommendation dict plus an accuracy_gate spec your CI can enforce to block silent precision regressions. About 40 minutes on a live NVIDIA GPU pod — PyTorch, CUDA, and autocast are preinstalled.

The substance is where the real insights compound. Batch sweep: throughput flattens before VRAM runs out, because once you saturate compute units, bigger batches buy you latency, not samples/sec — the knee is where marginal throughput/marginal VRAM drops below useful. Precision sweep: fp16 typically lands ≥1.15× fp32 throughput with cosine similarity >0.95 on well-conditioned transformers, but bf16 keeps fp32's 8-bit exponent range (only 7 mantissa bits) which resists overflow on models with wide activations where fp16's 5-bit exponent struggles. Combined sweep: changing precision and raising batch simultaneously usually delivers a 3×+ lift vs fp32 @ batch=1 inside the same memory envelope — that's where the real deployment win lives. The SKU twist: H100 has native FP8 Tensor Cores (E4M3/E5M2) and markedly stronger bf16 throughput, so FP8 via TensorRT-LLM or vLLM is often the H100 choice; A100 lacks hardware FP8, so bf16 dominates. Picking the same precision for both SKUs wastes H100 capacity or destabilizes A100.

The judgment trap: cosine similarity is a sanity check, not a quality gate. Two outputs can be cos-sim 0.99 and still produce meaningfully different generations because activation vectors are close in Euclidean terms but decode to different tokens. The real release gate is a task-level metric — accuracy, BLEU, perplexity, grounded-answer rate — measured on a representative set. The second trap: measured peak VRAM on a synthetic input isn't production peak. KV-cache spikes from long sequences, fragmentation from variable-length requests, or a batch that lands with unusually tall activations can push a tightly-fitted config OOM. The 5-40% oom_margin_pct band exists because below 5% you'll see OOMs in production and above 40% you're burning a half-GPU.

Prereqs: PyTorch forward-pass comfort, a rough feel for GPU memory, conceptual familiarity with fp16/bf16/fp32. Preinstalled: PyTorch, CUDA, torch.autocast, JupyterLab. Grading enforces real structure: the batch sweep needs ≥4 distinct ordered batch sizes with ≥2× max/min throughput speedup, the precision sweep requires all three precisions with fp16 ≥1.15× fp32 and cos-sim >0.95 for fp16/bf16, the combined sweep must deliver ≥3× over fp32 @ batch=1, the sku_to_preferred_precision map must treat A100 and H100 distinctly, and accuracy_gate must have metric + threshold + measurement method.

Frequently asked questions

Why sweep batch size instead of just picking the biggest that fits?

Because throughput flattens before VRAM runs out. Once you saturate the GPU's compute units, bigger batches buy you mostly latency, not samples/second — and they move your p99 further away from your SLA. The 'knee' is the point where marginal throughput per marginal VRAM drops below a useful rate. In Step 1 you'll plot the curve and pick it by eye; in production that curve also needs to respect p99 latency, which is why Step 4 asks for tradeoffs, not just the peak point.

fp16 vs bf16 — when does the choice actually matter?

bf16 keeps fp32's exponent range (8 bits) but only has 7 bits of mantissa, so it resists overflow/underflow better than fp16 (5-bit exponent, 10-bit mantissa) in training and for models with wide activation ranges. For inference on well-conditioned transformers, fp16 often wins on throughput on A100/consumer cards. On H100, bf16 and fp16 are both TC-accelerated and the gap narrows. The lab has you measure cosine similarity for both — that's the sanity check before you commit.

Why does the production recommendation include an OOM safety margin?

Because measured peak VRAM on a synthetic input isn't production peak VRAM. KV-cache spikes from long sequences, fragmentation from variable-length requests, a batch that happens to land with unusually tall activations — any of these can push a tightly-fitted config over the edge. The lab enforces a 5–40% margin in oom_margin_pct because that's the band where you're paranoid but not wasteful. Below 5% you will see OOMs in production; above 40% you're burning a half-GPU for nothing.

Is cosine similarity a good enough quality gate?

No, and that's exactly why Step 4 asks you to author an accuracy_gate with a task-level metric instead. Two model outputs can be cosine similarity 0.99 and still produce meaningfully different generations — the activation vectors are close in Euclidean terms but decode to different tokens. Cosine similarity is a useful sanity check for finding gross regressions (a precision mode that broke the model); the real release gate is the downstream metric users actually care about — accuracy, BLEU, perplexity, grounded-answer rate — measured on a representative set.

Why should my A100 and H100 preferred precisions differ?

H100 has native FP8 Tensor Cores (E4M3/E5M2) and markedly stronger bf16 throughput, so FP8 via TensorRT-LLM or vLLM is often the production choice there. A100 lacks hardware FP8, so bf16 (and fp16 on older deployments) dominate. Picking the same precision for both wastes H100 capacity or destabilizes A100 — the sku_to_preferred_precision map in Step 4 is how you surface that difference to whoever runs the deployment.

What do the grading checks actually enforce?

Step 1 requires ≥4 batch-size points, distinct and sorted, with a ≥2× max/min throughput speedup and visible VRAM growth, plus a knee_batch_size that's one of the sampled batches. Step 2 requires entries for fp32, fp16, and bf16 with fp16 ≥1.15× fp32 throughput and cosine similarity >0.95 for the non-fp32 modes. Step 3 requires ≥6 combined points, an fp32 @ batch=1 baseline, a best_config that fits its stated VRAM budget and delivers ≥3× over the baseline. Step 4 requires a complete production_recommendation, a non-trivial SKU map, and an accuracy_gate dict.