Question 1

Why sweep batch size instead of just picking the biggest that fits?

Accepted Answer

Because throughput flattens before VRAM runs out. Once you saturate the GPU's compute units, bigger batches buy you mostly latency, not samples/second — and they move your p99 further away from your SLA. The 'knee' is the point where marginal throughput per marginal VRAM drops below a useful rate. In Step 1 you'll plot the curve and pick it by eye; in production that curve also needs to respect p99 latency, which is why Step 4 asks for tradeoffs, not just the peak point.

Question 2

fp16 vs bf16 — when does the choice actually matter?

Accepted Answer

bf16 keeps fp32's exponent range (8 bits) but only has 7 bits of mantissa, so it resists overflow/underflow better than fp16 (5-bit exponent, 10-bit mantissa) in training and for models with wide activation ranges. For inference on well-conditioned transformers, fp16 often wins on throughput on A100/consumer cards. On H100, bf16 and fp16 are both TC-accelerated and the gap narrows. The lab has you measure cosine similarity for both — that's the sanity check before you commit.

Question 3

Why does the production recommendation include an OOM safety margin?

Accepted Answer

Because measured peak VRAM on a synthetic input isn't production peak VRAM. KV-cache spikes from long sequences, fragmentation from variable-length requests, a batch that happens to land with unusually tall activations — any of these can push a tightly-fitted config over the edge. The lab enforces a 5–40% margin in `oom_margin_pct` because that's the band where you're paranoid but not wasteful. Below 5% you will see OOMs in production; above 40% you're burning a half-GPU for nothing.

Question 4

Is cosine similarity a good enough quality gate?

Accepted Answer

No, and that's exactly why Step 4 asks you to author an `accuracy_gate` with a task-level metric instead. Two model outputs can be cosine similarity 0.99 and still produce meaningfully different generations — the activation vectors are close in Euclidean terms but decode to different tokens. Cosine similarity is a useful *sanity check* for finding gross regressions (a precision mode that broke the model); the real release gate is the downstream metric users actually care about — accuracy, BLEU, perplexity, grounded-answer rate — measured on a representative set.

Question 5

Why should my A100 and H100 preferred precisions differ?

Accepted Answer

H100 has native FP8 Tensor Cores (E4M3/E5M2) and markedly stronger bf16 throughput, so FP8 via TensorRT-LLM or vLLM is often the production choice there. A100 lacks hardware FP8, so bf16 (and fp16 on older deployments) dominate. Picking the same precision for both wastes H100 capacity or destabilizes A100 — the `sku_to_preferred_precision` map in Step 4 is how you surface that difference to whoever runs the deployment.

Question 6

What do the grading checks actually enforce?

Accepted Answer

Step 1 requires ≥4 batch-size points, distinct and sorted, with a ≥2× max/min throughput speedup and visible VRAM growth, plus a `knee_batch_size` that's one of the sampled batches. Step 2 requires entries for fp32, fp16, and bf16 with fp16 ≥1.15× fp32 throughput and cosine similarity >0.95 for the non-fp32 modes. Step 3 requires ≥6 combined points, an fp32 @ batch=1 baseline, a `best_config` that fits its stated VRAM budget and delivers ≥3× over the baseline. Step 4 requires a complete `production_recommendation`, a non-trivial SKU map, and an `accuracy_gate` dict.

Batch Size & Precision Sweep: Finding Your Sweet Spot

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll sweep in this batch-and-precision lab

Frequently asked questions