Deploy & Serve LLMs in Production
GPU sandbox · ide
Beta

Deploy & Serve LLMs in Production

Go from slow single-request inference to production-ready LLM serving with vLLM. Benchmark throughput, tune settings, and learn when to use vLLM vs Triton vs TGI.

45 min·5 steps·3 domains·Intermediate·ncp-genlnca-genl

What you'll learn

  1. 1
    The Naive Approach
    Loading a model with transformers and calling model.generate() works for experimentation, but it's terrible for production:
  2. 2
    Launch vLLM & Query It
    vLLM is the most popular open-source LLM serving engine. It uses:
  3. 3
    Benchmark Under Load
    The real test isn't single-request speed — it's concurrent requests. This is where vLLM's continuous batching shines.
  4. 4
    Tune vLLM Settings
    vLLM has settings that directly affect performance:
  5. 5
    Production Patterns
    Users expect to see tokens appear as they're generated. vLLM supports streaming via the standard OpenAI stream=True parameter.

Prerequisites

  • Basic Python (functions, async/await)
  • Understanding of what LLMs are and how they generate text
  • Familiarity with REST APIs

Exam domains covered

Model Deployment & Inference OptimizationGPU Acceleration & Distributed TrainingLLM Architecture & Infrastructure

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

vLLMInferenceServingContinuous BatchingPagedAttentionThroughputLatencyOpenAI API

What you'll build in this vLLM serving lab

Across five steps you'll rebuild an LLM inference stack from naive model.generate() up to a production-grade vLLM server, and you'll benchmark the gap at every stage. Step 1 loads Llama 3 8B Instruct with Hugging Face transformers and fires five sequential requests from naive_benchmark.py so you can see GPU underutilisation in seconds-per-request. Step 2 boots vllm.entrypoints.openai.api_server on port 9000 and wires the OpenAI SDK at base_url=http://localhost:9000/v1. Step 3 hammers the server with ten concurrent prompts via asyncio.gather and AsyncOpenAI. Step 4 restarts vLLM with a shorter --max-model-len and a higher --gpu-memory-utilization to tune KV-cache-per-request. Step 5 turns on stream=True and measures time-to-first-token against total wall time.

The through-line is that LLM serving is a KV-cache-scheduling problem, not a kernel problem. PagedAttention treats the KV cache like an OS treats virtual memory — pages of 16 tokens, no contiguous allocation, no fragmentation waste — which is what lets continuous batching inject new requests mid-decode instead of waiting for a static batch to drain. You'll watch gpu_cache_usage_perc in /metrics climb as concurrency rises, and feel directly why max_model_len is the tuning knob most operators reach for: halve it and you double the number of sequences that fit in the KV cache. By the end you can reason about vLLM vs Triton vs TGI vs TensorRT-LLM without hand-waving.

Prerequisites are light — Python async/await, an understanding of what an LLM token is, and basic REST familiarity. Everything else ships in the sandbox we provision: a real NVIDIA GPU pod with vLLM preinstalled, Llama 3 8B Instruct cached at /models/meta-llama--Meta-Llama-3-8B-Instruct, and an IDE terminal that keeps your vLLM process running between steps. Grading is per-step: each check re-runs your workspace script (naive_benchmark.py, query_vllm.py, concurrent_benchmark.py, tune_benchmark.py, streaming_demo.py), verifies the server responded, and confirms you measured the metric the step asked for — TTFT, tokens-per-second, or total wall time.

Frequently asked questions

Why is transformers.generate() so much slower than vLLM for the same model?

Two reasons. First, generate() processes requests sequentially: one forward pass per new token, one request at a time, GPU idle between steps. Second, it has no paged KV cache — every sequence reserves a contiguous block sized for its maximum length, wasting 60-80% of VRAM on padding. vLLM fixes both with continuous batching (new requests join an in-flight batch mid-decode) and PagedAttention (KV cache stored as non-contiguous 16-token blocks). Same weights, same math, 10-24× throughput in practice.

What does --gpu-memory-utilization actually control?

It's the fraction of total VRAM vLLM is allowed to claim for weights plus KV cache. At 0.85 on a 24 GB card, vLLM treats 20.4 GB as its budget: model weights take their fixed cut (~16 GB for Llama 3 8B in FP16), and the rest becomes the KV cache pool shared across all concurrent sequences. Pushing it to 0.90 or 0.92 squeezes out more concurrent capacity; going too high risks OOM when a spike of long prompts arrives. It does not change per-token speed — only how many sequences can coexist.

Why does a shorter --max-model-len speed up throughput if the math per token is identical?

It doesn't speed up individual tokens; it speeds up the system. KV cache memory per sequence scales linearly with max_model_len, so halving it from 2048 to 512 lets roughly 4× more sequences fit in the same cache budget. More concurrent sequences means more useful work per forward pass means higher aggregate tokens/sec. The tradeoff is real: users who send prompts longer than your cap get truncated or rejected, so the right value is governed by your workload's P99 prompt length, not by benchmark numbers.

Why measure TTFT instead of just total latency?

Because end-users perceive streaming latency as TTFT. A 4-second response that streams its first token in 200 ms feels snappy; the same 4-second response delivered as one JSON blob feels broken. TTFT is dominated by the prefill pass (processing the prompt) plus queueing; decode-tokens-per-second dominates afterward. They're separately tunable — prefill benefits from chunked prefill and tensor parallelism, decode benefits from KV cache efficiency and batching — so any serious serving dashboard tracks them as two distinct SLOs.

When should I use Triton or TensorRT-LLM instead of vLLM?

Triton shines when you serve heterogeneous models — an LLM plus an embedding model plus a reranker in one pipeline, or when you need ensemble/BLS flows. vLLM is LLM-only. TensorRT-LLM squeezes out the absolute best single-model latency on NVIDIA hardware because it pre-compiles optimized kernels per GPU arch, but it costs engine-build time and flexibility. In practice most teams start on vLLM because the OpenAI-compatible API is a drop-in, and only switch when they hit a specific ceiling that vLLM provably can't clear.

How is each step graded?

Each check script runs your workspace file with a 60-120 second timeout and verifies the side effects. Step 1 parses naive_benchmark.py and confirms it calls model.generate() and times requests. Step 2 actually hits http://localhost:9000/health and runs query_vllm.py end-to-end. Step 3 runs the concurrent benchmark and checks wall time vs aggregate latency. Step 4 confirms vLLM is reachable after your restart and prints throughput. Step 5 verifies ttft was measured during streaming. If a step needs vLLM running, starting it in the terminal is part of the exercise.