Inference Serving Patterns: Dynamic Batching, Throughput, and the Triton Mental Model
Build a mini-Triton inference server in ~30 lines of Python: a dynamic batcher with max_batch_size and max_queue_delay knobs, load-tested against a naive baseline, swept for the throughput-latency tradeoff, and bridged to a real Triton config.pbtxt.
What you'll learn
- 1The naive baseline: one request at a time
- 2Dynamic batching: the one technique that 3-10x's throughput
- 3The throughput-latency tradeoff: sweep max_batch_size
- 4Production bridge: the Triton config.pbtxt equivalent
Prerequisites
- PyTorch tensors and model forward pass
- Python threading / queue basics
- Concept of QPS, p50, p99 latency
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this inference-serving lab
Dynamic batching is the single technique that 3-10×'s inference throughput, and most engineers using Triton or TorchServe have never actually built one. This lab fixes that: you'll write a working mini-Triton server in ~30 lines of Python — a DynamicBatcher class with a background thread, an in-memory queue, max_batch_size and max_queue_delay_ms knobs, per-request Futures — then load-test it, sweep the tradeoff curve, and translate the whole thing into a real Triton config.pbtxt. You'll walk away with a mental model of the throughput-vs-p99 curve that transfers directly to Triton, TorchServe, vLLM, or any serving stack, plus the judgment to pick max_batch_size and max_queue_delay_microseconds against an actual SLA instead of maxing QPS in isolation. About 40 minutes on a live NVIDIA GPU pod — PyTorch, futures, and the load-test harness are ready.
The substance is the knobs and what they cost. max_batch_size sets the ceiling; pushing it higher lifts QPS until larger batches' latency blows past your SLA — that's why one number isn't the answer. max_queue_delay_microseconds is the maximum extra wait you'll inflict on an early arrival in exchange for catching more requests in the same batch, and it sets a floor on the tail latency any single request can see: configure 10 ms of queue delay and a request arriving alone still waits 10 ms for no benefit. Set it too low, the batcher starves and you're back to naive serving; set it too high, tail requests exceed SLA. The production rule: subtract your network/queue budget from the user-facing SLA, and the highest batch size that stays under that wall is the answer — not the point with globally highest QPS.
The mental-model distinction engineers get wrong: dynamic (request-level) batching groups N separate incoming requests into one forward pass and returns N results — whole batch starts and ends together. Continuous (iteration-level) batching, which vLLM implements, operates at the token-generation step: at every decode step it re-packs the active sequence set, adds new ones that arrived mid-generation, evicts finished ones. You need continuous batching for LLM decoding because responses have wildly different lengths; you don't need it for a classifier that outputs one tensor per request. This lab covers the first pattern; the vLLM lab covers the second. Your mini-server maps almost 1:1 to Triton — max_batch_size → max_batch_size, max_queue_delay_ms → max_queue_delay_microseconds, thread-count → instance_group.count — which is why writing the bare-metal version once makes every production serving config readable.
Prereqs: PyTorch forward-pass basics, Python threading + queue.Queue familiarity, a grasp of QPS/p50/p99. Preinstalled: PyTorch, load-test harness, JupyterLab. Grading checks real measurements: naive baseline has positive QPS with p99 ≥ p50, batched QPS beats naive by at least 1.05× (proving the batcher thread is firing), sweep_results has ≥3 points with a best QPS at least 1.1× the smallest batch (proving a real sweet spot exists), and model_config_pbtxt contains every Triton-required field plus a valid platform (pytorch_libtorch, onnxruntime_onnx, tensorrt_plan, python, or vllm).
Frequently asked questions
How is this different from the vLLM lab?
Why does max_queue_delay_microseconds matter so much?
max_queue_delay_microseconds matter so much?What's the difference between dynamic batching and continuous batching?
What does a real Triton config.pbtxt add on top of my DynamicBatcher?
version_policy to control which model version is live, preferred_batch_size with a list of sizes to warmup cuDNN autotuner on, instance_group to run multiple model replicas per GPU, metrics export for Prometheus, gRPC and HTTP endpoints, model warmup sequences, ensemble pipelines, and explicit GPU selection. The DynamicBatcher teaches the core loop; Triton adds everything around it. Step 4 specifically calls out the pieces the mini-server doesn't have.How do I pick max_batch_size from the sweep data?
max_batch_size. Walk up the curve until p99 brushes your SLA headroom — subtract the network/queue budget from your user-facing SLA, and that's your wall. The highest batch size that stays under the wall is your answer, not the one with the globally highest QPS. If the curve peaks early, you're compute-bound at that batch size and you need a smaller model or more GPUs, not a bigger batch.What does each step's grader check?
naive_qps > 0, 0 < naive_p50_ms < 10_000, and naive_p99_ms >= naive_p50_ms. Step 2 requires batched_qps > naive_qps * 1.05 — proving the batcher thread is actually firing. Step 3 validates sweep_results has ≥3 entries with the required keys and that the best QPS exceeds the smallest-batch run by ≥1.1×, catching broken sweeps. Step 4 enforces that model_config_pbtxt contains name:, platform:, max_batch_size:, input, output, and dynamic_batching, and that the declared platform is one of Triton's valid backends.