Question 1

How is this different from the vLLM lab?

Accepted Answer

This lab teaches the general serving pattern — dynamic (request-level) batching with a queue, a delay knob, and a batch-size ceiling — that works for any model served by Triton, TorchServe, or your own harness. vLLM implements a different pattern called continuous (iteration-level) batching, which only makes sense for autoregressive LLM decoding where each step produces one token per sequence. Take both: this one for convnets / classifiers / general inference, vLLM for LLM-specific serving.

Question 2

Why does `max_queue_delay_microseconds` matter so much?

Accepted Answer

Because it sets a floor on the tail latency any single request can see. If you configure 10 ms of queue delay and a request arrives alone, it waits 10 ms for no benefit — and that latency is baked into your p99. Setting it too low starves the batcher and you're back to naive serving; setting it too high lets tail requests exceed your SLA. The Step 3 sweep shows you the curve; the Step 4 reflection forces you to pick a specific value that stays inside a 150 ms p99 SLA with room to spare.

Question 3

What's the difference between dynamic batching and continuous batching?

Accepted Answer

Dynamic (request-level) batching, which this lab implements, groups N separate incoming requests into one forward pass and returns N results — the whole batch starts and ends together. Continuous batching (what vLLM does) operates at the token-generation step: at every decode step it re-packs the active set of sequences, adds new ones that arrived mid-generation, and evicts ones that finished. You need continuous batching for LLM decoding because responses have wildly different lengths; you don't need it for a classifier that outputs one tensor per request.

Question 4

What does a real Triton config.pbtxt add on top of my DynamicBatcher?

Accepted Answer

Production pieces you didn't implement: `version_policy` to control which model version is live, `preferred_batch_size` with a list of sizes to warmup cuDNN autotuner on, `instance_group` to run multiple model replicas per GPU, metrics export for Prometheus, gRPC and HTTP endpoints, model warmup sequences, ensemble pipelines, and explicit GPU selection. The DynamicBatcher teaches the core loop; Triton adds everything around it. Step 4 specifically calls out the pieces the mini-server doesn't have.

Question 5

How do I pick max_batch_size from the sweep data?

Accepted Answer

Plot QPS and p99 on a dual axis against `max_batch_size`. Walk up the curve until p99 brushes your SLA headroom — subtract the network/queue budget from your user-facing SLA, and that's your wall. The highest batch size that stays under the wall is your answer, not the one with the globally highest QPS. If the curve peaks early, you're compute-bound at that batch size and you need a smaller model or more GPUs, not a bigger batch.

Question 6

What does each step's grader check?

Accepted Answer

Step 1 verifies `naive_qps > 0`, `0 < naive_p50_ms < 10_000`, and `naive_p99_ms >= naive_p50_ms`. Step 2 requires `batched_qps > naive_qps * 1.05` — proving the batcher thread is actually firing. Step 3 validates `sweep_results` has ≥3 entries with the required keys and that the best QPS exceeds the smallest-batch run by ≥1.1×, catching broken sweeps. Step 4 enforces that `model_config_pbtxt` contains `name:`, `platform:`, `max_batch_size:`, `input`, `output`, and `dynamic_batching`, and that the declared platform is one of Triton's valid backends.

Inference Serving Patterns: Dynamic Batching, Throughput, and the Triton Mental Model

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this inference-serving lab

Frequently asked questions

How is this different from the vLLM lab?

Why does `max_queue_delay_microseconds` matter so much?

What's the difference between dynamic batching and continuous batching?

What does a real Triton config.pbtxt add on top of my DynamicBatcher?

How do I pick max_batch_size from the sweep data?

What does each step's grader check?

Inference Serving Patterns: Dynamic Batching, Throughput, and the Triton Mental Model

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this inference-serving lab

Frequently asked questions

How is this different from the vLLM lab?

Why does max_queue_delay_microseconds matter so much?

What's the difference between dynamic batching and continuous batching?

What does a real Triton config.pbtxt add on top of my DynamicBatcher?

How do I pick max_batch_size from the sweep data?

What does each step's grader check?

Why does `max_queue_delay_microseconds` matter so much?