Question 1

Why does PagedAttention matter for throughput?

Accepted Answer

Because classical KV caches pre-allocate the maximum sequence length per slot, wasting enormous amounts of VRAM on short generations and fragmenting the free list. PagedAttention stores K/V in fixed-size blocks (typically 16 tokens) and uses a block table to map logical token positions to physical blocks — so a 50-token generation consumes ~3 blocks instead of a full 2048-token slot. The KV cache then holds *far* more concurrent sequences with the same VRAM, and `max_concurrency` in the lab's output shows the multiplier directly.

Question 2

Continuous batching vs dynamic batching — what's the actual difference?

Accepted Answer

Dynamic batching groups N independent requests into one forward pass and returns N results — the whole batch starts and finishes together. Continuous (iteration-level) batching operates at each decode step: after generating one token for every active sequence, vLLM re-packs the batch, admits newly-arrived requests mid-stream, and evicts sequences that hit EOS. It only makes sense for autoregressive decoding, where responses have wildly different lengths — continuous batching is why LLM serving throughput scales with concurrency instead of collapsing on tail-length variance.

Question 3

When does prefix caching actually pay off?

Accepted Answer

When you have a long shared prefix across requests and the model re-runs it each time: long system prompts ('You are a helpful assistant who...'), RAG where the retrieved context dominates, few-shot prompting with a fixed block of examples, and agents that replay a shared policy preamble. vLLM hashes the token sequence and reuses the KV blocks computed for matching prefixes. The lab's Step 3 populates `cache_use_cases` with exactly these scenarios — it's the list you'd use to decide whether to turn the flag on in your own deployment.

Question 4

What makes a good vLLM autoscaling metric?

Accepted Answer

Not raw QPS — an 8k-token request consumes vastly more KV cache than a 100-token one, so counting requests misses the actual saturation signal. KV-cache utilization or queue depth capture the real pressure on PagedAttention, TTFT (time-to-first-token) is a late-binding latency signal good for scale-up triggers, and GPU utilization alone is misleading because it saturates at 100% long before throughput plateaus. The lab's reflection at the end asks you to defend the choice you made, with specific failure modes for each alternative.

Question 5

Do I need to configure `vllm_server_args` by hand or is there a default?

Accepted Answer

There are defaults but they're rarely what you want. `--gpu-memory-utilization` (how much VRAM vLLM claims for weights + KV cache, usually 0.85–0.95) and `--max-model-len` (context window ceiling, which trades per-request length for total concurrency) are the two knobs with the biggest impact. `--max-num-seqs` caps parallel sequences. The lab requires `--model`, `--host`, `--port`, `--max-model-len`, and `--gpu-memory-utilization` in your CLI list because those are the non-default-safe ones for production.

Question 6

What does each grader enforce?

Accepted Answer

Step 1 requires `first_output` to be a non-trivial string, `kv_cache_info` to include `max_tokens`/`max_concurrency`/`block_size`, with `block_size >= 8` and `max_concurrency >= 4`. Step 2 requires ≥16 prompts in the batch, throughput ≥50 tok/s, and ≥1.3× speedup over single-request. Step 3 needs ≥50 shared prefix tokens, warm < cold, speedup ≥1.05×, and ≥3 `cache_use_cases`. Step 4 validates every required CLI flag, Kubernetes Deployment primitives (`nvidia.com/gpu`, readiness/liveness probes, `/v1/completions` endpoint), ≥4 monitoring metrics with at least one vLLM-specific, and a complete autoscaling policy.

vLLM Production Serving: PagedAttention, Continuous Batching, Prefix Caching

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll measure in this vLLM production-serving lab

Frequently asked questions

Why does PagedAttention matter for throughput?

Continuous batching vs dynamic batching — what's the actual difference?

When does prefix caching actually pay off?

What makes a good vLLM autoscaling metric?

Do I need to configure `vllm_server_args` by hand or is there a default?

What does each grader enforce?

vLLM Production Serving: PagedAttention, Continuous Batching, Prefix Caching

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll measure in this vLLM production-serving lab

Frequently asked questions

Why does PagedAttention matter for throughput?

Continuous batching vs dynamic batching — what's the actual difference?

When does prefix caching actually pay off?

What makes a good vLLM autoscaling metric?

Do I need to configure vllm_server_args by hand or is there a default?

What does each grader enforce?

Do I need to configure `vllm_server_args` by hand or is there a default?