Deploy & Serve LLMs in Production
Go from slow single-request inference to production-ready LLM serving with vLLM. Benchmark throughput, tune settings, and learn when to use vLLM vs Triton vs TGI.
What you'll learn
- 1The Naive ApproachLoading a model with transformers and calling model.generate() works for experimentation, but it's terrible for production:
- 2Launch vLLM & Query ItvLLM is the most popular open-source LLM serving engine. It uses:
- 3Benchmark Under LoadThe real test isn't single-request speed — it's concurrent requests. This is where vLLM's continuous batching shines.
- 4Tune vLLM SettingsvLLM has settings that directly affect performance:
- 5Production PatternsUsers expect to see tokens appear as they're generated. vLLM supports streaming via the standard OpenAI stream=True parameter.
Prerequisites
- Basic Python (functions, async/await)
- Understanding of what LLMs are and how they generate text
- Familiarity with REST APIs
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this vLLM serving lab
Across five steps you'll rebuild an LLM inference stack from naive model.generate() up to a production-grade vLLM server, and you'll benchmark the gap at every stage. Step 1 loads Llama 3 8B Instruct with Hugging Face transformers and fires five sequential requests from naive_benchmark.py so you can see GPU underutilisation in seconds-per-request. Step 2 boots vllm.entrypoints.openai.api_server on port 9000 and wires the OpenAI SDK at base_url=http://localhost:9000/v1. Step 3 hammers the server with ten concurrent prompts via asyncio.gather and AsyncOpenAI. Step 4 restarts vLLM with a shorter --max-model-len and a higher --gpu-memory-utilization to tune KV-cache-per-request. Step 5 turns on stream=True and measures time-to-first-token against total wall time.
The through-line is that LLM serving is a KV-cache-scheduling problem, not a kernel problem. PagedAttention treats the KV cache like an OS treats virtual memory — pages of 16 tokens, no contiguous allocation, no fragmentation waste — which is what lets continuous batching inject new requests mid-decode instead of waiting for a static batch to drain. You'll watch gpu_cache_usage_perc in /metrics climb as concurrency rises, and feel directly why max_model_len is the tuning knob most operators reach for: halve it and you double the number of sequences that fit in the KV cache. By the end you can reason about vLLM vs Triton vs TGI vs TensorRT-LLM without hand-waving.
Prerequisites are light — Python async/await, an understanding of what an LLM token is, and basic REST familiarity. Everything else ships in the sandbox we provision: a real NVIDIA GPU pod with vLLM preinstalled, Llama 3 8B Instruct cached at /models/meta-llama--Meta-Llama-3-8B-Instruct, and an IDE terminal that keeps your vLLM process running between steps. Grading is per-step: each check re-runs your workspace script (naive_benchmark.py, query_vllm.py, concurrent_benchmark.py, tune_benchmark.py, streaming_demo.py), verifies the server responded, and confirms you measured the metric the step asked for — TTFT, tokens-per-second, or total wall time.
Frequently asked questions
Why is transformers.generate() so much slower than vLLM for the same model?
transformers.generate() so much slower than vLLM for the same model?generate() processes requests sequentially: one forward pass per new token, one request at a time, GPU idle between steps. Second, it has no paged KV cache — every sequence reserves a contiguous block sized for its maximum length, wasting 60-80% of VRAM on padding. vLLM fixes both with continuous batching (new requests join an in-flight batch mid-decode) and PagedAttention (KV cache stored as non-contiguous 16-token blocks). Same weights, same math, 10-24× throughput in practice.What does --gpu-memory-utilization actually control?
--gpu-memory-utilization actually control?Why does a shorter --max-model-len speed up throughput if the math per token is identical?
--max-model-len speed up throughput if the math per token is identical?max_model_len, so halving it from 2048 to 512 lets roughly 4× more sequences fit in the same cache budget. More concurrent sequences means more useful work per forward pass means higher aggregate tokens/sec. The tradeoff is real: users who send prompts longer than your cap get truncated or rejected, so the right value is governed by your workload's P99 prompt length, not by benchmark numbers.Why measure TTFT instead of just total latency?
When should I use Triton or TensorRT-LLM instead of vLLM?
How is each step graded?
naive_benchmark.py and confirms it calls model.generate() and times requests. Step 2 actually hits http://localhost:9000/health and runs query_vllm.py end-to-end. Step 3 runs the concurrent benchmark and checks wall time vs aggregate latency. Step 4 confirms vLLM is reachable after your restart and prints throughput. Step 5 verifies ttft was measured during streaming. If a step needs vLLM running, starting it in the terminal is part of the exercise.