Question 1

Why is `transformers.generate()` so much slower than vLLM for the same model?

Accepted Answer

Two reasons. First, `generate()` processes requests sequentially: one forward pass per new token, one request at a time, GPU idle between steps. Second, it has no paged KV cache — every sequence reserves a contiguous block sized for its maximum length, wasting 60-80% of VRAM on padding. vLLM fixes both with continuous batching (new requests join an in-flight batch mid-decode) and PagedAttention (KV cache stored as non-contiguous 16-token blocks). Same weights, same math, 10-24× throughput in practice.

Question 2

What does `--gpu-memory-utilization` actually control?

Accepted Answer

It's the fraction of total VRAM vLLM is allowed to claim for weights plus KV cache. At 0.85 on a 24 GB card, vLLM treats 20.4 GB as its budget: model weights take their fixed cut (~16 GB for Llama 3 8B in FP16), and the rest becomes the KV cache pool shared across all concurrent sequences. Pushing it to 0.90 or 0.92 squeezes out more concurrent capacity; going too high risks OOM when a spike of long prompts arrives. It does not change per-token speed — only how many sequences can coexist.

Question 3

Why does a shorter `--max-model-len` speed up throughput if the math per token is identical?

Accepted Answer

It doesn't speed up individual tokens; it speeds up the system. KV cache memory per sequence scales linearly with `max_model_len`, so halving it from 2048 to 512 lets roughly 4× more sequences fit in the same cache budget. More concurrent sequences means more useful work per forward pass means higher aggregate tokens/sec. The tradeoff is real: users who send prompts longer than your cap get truncated or rejected, so the right value is governed by your workload's P99 prompt length, not by benchmark numbers.

Question 4

Why measure TTFT instead of just total latency?

Accepted Answer

Because end-users perceive streaming latency as TTFT. A 4-second response that streams its first token in 200 ms feels snappy; the same 4-second response delivered as one JSON blob feels broken. TTFT is dominated by the prefill pass (processing the prompt) plus queueing; decode-tokens-per-second dominates afterward. They're separately tunable — prefill benefits from chunked prefill and tensor parallelism, decode benefits from KV cache efficiency and batching — so any serious serving dashboard tracks them as two distinct SLOs.

Question 5

When should I use Triton or TensorRT-LLM instead of vLLM?

Accepted Answer

Triton shines when you serve heterogeneous models — an LLM plus an embedding model plus a reranker in one pipeline, or when you need ensemble/BLS flows. vLLM is LLM-only. TensorRT-LLM squeezes out the absolute best single-model latency on NVIDIA hardware because it pre-compiles optimized kernels per GPU arch, but it costs engine-build time and flexibility. In practice most teams start on vLLM because the OpenAI-compatible API is a drop-in, and only switch when they hit a specific ceiling that vLLM provably can't clear.

Question 6

How is each step graded?

Accepted Answer

Each check script runs your workspace file with a 60-120 second timeout and verifies the side effects. Step 1 parses `naive_benchmark.py` and confirms it calls `model.generate()` and times requests. Step 2 actually hits `http://localhost:9000/health` and runs `query_vllm.py` end-to-end. Step 3 runs the concurrent benchmark and checks wall time vs aggregate latency. Step 4 confirms vLLM is reachable after your restart and prints throughput. Step 5 verifies `ttft` was measured during streaming. If a step needs vLLM running, starting it in the terminal is part of the exercise.

Deploy & Serve LLMs in Production

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this vLLM serving lab

Frequently asked questions

Why is `transformers.generate()` so much slower than vLLM for the same model?

What does `--gpu-memory-utilization` actually control?

Why does a shorter `--max-model-len` speed up throughput if the math per token is identical?

Why measure TTFT instead of just total latency?

When should I use Triton or TensorRT-LLM instead of vLLM?

How is each step graded?

Deploy & Serve LLMs in Production

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this vLLM serving lab

Frequently asked questions

Why is transformers.generate() so much slower than vLLM for the same model?

What does --gpu-memory-utilization actually control?

Why does a shorter --max-model-len speed up throughput if the math per token is identical?

Why measure TTFT instead of just total latency?

When should I use Triton or TensorRT-LLM instead of vLLM?

How is each step graded?

Why is `transformers.generate()` so much slower than vLLM for the same model?

What does `--gpu-memory-utilization` actually control?

Why does a shorter `--max-model-len` speed up throughput if the math per token is identical?