Deploy & Serve LLMs in Production (Jupyter)
GPU sandbox · jupyter
Beta

Deploy & Serve LLMs in Production (Jupyter)

Go from slow single-request inference to production-ready LLM serving with vLLM. Benchmark throughput, tune settings, and learn when to use vLLM vs Triton vs TGI.

45 min·5 steps·3 domains·Intermediate·ncp-genlnca-genl

What you'll learn

  1. 1
    The Naive Approach
  2. 2
    Launch vLLM & Query It
  3. 3
    Benchmark Under Load
  4. 4
    Tune vLLM Settings
  5. 5
    Production Patterns

Prerequisites

  • Basic Python (functions, async/await)
  • Understanding of what LLMs are and how they generate text
  • Familiarity with REST APIs

Exam domains covered

Model Deployment & Inference OptimizationGPU Acceleration & Distributed TrainingLLM Architecture & Infrastructure

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

vLLMInferenceServingContinuous BatchingPagedAttentionJupyterLab

What you'll build in this vLLM-in-JupyterLab serving lab

The gap between 'the model runs on my GPU' and 'the model serves real users' is almost entirely about inference infrastructure — and vLLM is the default stack engineers reach for in 2026. In 45 minutes you'll benchmark the whole path cell-by-cell in JupyterLab: a naive model.generate() baseline, a vLLM OpenAI-compatible server, concurrent requests via AsyncOpenAI, throughput under two different --max-model-len settings, and streaming with TTFT measured. You walk away with a direct visual comparison of latency and throughput between naive and optimized serving, a feel for what PagedAttention and continuous batching actually pay off in numbers, an understanding of why TTFT matters more than total latency for chat UX (a flat first 2 seconds feels broken; a 200 ms TTFT followed by 60 tok/sec decode feels responsive), and a defensible answer to the production tradeoff — when to ship with max_model_len=2048 versus 512 for the capacity win.

The technical substance is what each measurement actually teaches. The naive baseline uses AutoModelForCausalLM.from_pretrained and five sequential model.generate() calls — stored in naive_times so every later cell plots against it. The vLLM server is launched from a JupyterLab Terminal (not a cell, because it's a long-running HTTP server that would block the kernel) via vllm.entrypoints.openai.api_server on port 9000, and you poll /health from a cell until it answers. Concurrent benchmarking uses asyncio.gather with ten parallel requests through AsyncOpenAI, showing that one GPU serves 10 users at once at throughput that would be impossible with serial generate(). Tuning with --max-model-len 512 vs 2048 surfaces the real tradeoff: shorter context doesn't make inference faster per-token, it lets more concurrent requests fit in the KV cache — a per-request memory knob, not a speed knob. Streaming with stream=True captures each token's arrival timestamp into token_times and plots the timeline with TTFT marked as a dashed line, which is the same instrumentation real serving dashboards expose. You'll also hit the Jupyter-specific VRAM trap: del model alone doesn't free weights because IPython's Out[] history and the _, __, ___ aliases pin the graph. The correct cleanup is %reset out -f then gc.collect() then torch.cuda.empty_cache(), in that order, which you'll use in every GPU notebook you write afterward.

Prerequisites are async Python, a rough sense of how LLMs produce tokens, and REST basics — no prior vLLM experience assumed. The sandbox is a real NVIDIA GPU pod we provision per session with JupyterLab, vLLM, Llama 3 8B Instruct, the OpenAI SDK, and matplotlib preinstalled. Checks run inline from cells via from preporato_labs import Lab; lab.check(N) and inspect kernel-resident state rather than re-executing scripts — naive_times must be 5 positive timings, vllm_times must be 5 positive timings after the server comes up, concurrent_results must be 10 completions with measured wall time, tune_results must be 10 after the restart, and the final streaming step requires ttft > 0 plus non-empty streamed_text.

Frequently asked questions

Why does del model not free VRAM in a Jupyter notebook?

Because IPython silently keeps several references you forgot about. Every cell's return value is cached in Out[<n>], and the last three outputs are aliased to _, __, and ___. If any of those point at a tensor, an inputs dict, or a generation output, the model graph they belong to stays pinned. Step 1 has you clear direct references, run %reset out -f plus %reset array -f, call gc.collect(), and only then torch.cuda.empty_cache(). After that, nvidia-smi confirms the VRAM is actually back — which is required before vLLM boots in Step 2.

Why does this lab run vLLM in a terminal instead of launching it from a cell?

Because vLLM is a long-running HTTP server, not a library call. Launching it from a notebook cell would block the kernel for the lab's duration. The JupyterLab terminal gives you a dedicated shell next to the notebook where vLLM streams its startup logs, shows the ready banner, and can be restarted with Ctrl+C between Step 3 and Step 4 — all while your notebook kernel keeps your naive_times and vllm_times lists alive for the final comparison plot.

What does PagedAttention actually page?

The KV cache. Without paging, every sequence reserves a contiguous VRAM block sized for its max_model_len, which wastes 60-80% of the cache on unused padding. PagedAttention splits each sequence's KV cache into fixed 16-token blocks that are allocated on demand from a shared pool, indexed through a block table — structurally identical to how an OS does virtual memory. The payoff is that continuous batching can add a new request mid-decode without needing to find a contiguous slot for it, and that gpu_cache_usage_perc in the /metrics endpoint you inspect in Step 4 actually reaches 90%+ under load.

Why plot tokens-over-time in Step 5 instead of just reporting total latency?

Because for chat UX, the shape of the arrival curve matters more than the endpoint. A flat first 2 seconds followed by a burst of tokens feels broken; a 200 ms TTFT followed by a steady 60 tok/sec decode feels responsive even if total time is identical. The Step 5 plot draws token_indices against token_times and overlays TTFT and total as dashed lines, so you can literally see the prefill/decode split. That's the same instrumentation real serving dashboards expose, just in-notebook.

Can I finish the lab and come back later without re-running everything?

Yes. The sandbox pod, its kernel state, and any files you've written to the workspace persist between sessions. Close the tab mid-way through Step 3 and when you return the vLLM server may need restarting (it lives in the terminal process, not the kernel), but your naive_times, vllm_times, and any plots you've drawn are still in the notebook, and lab.check(N) will re-inspect them without re-running any cells.

How does lab.check(N) in a cell differ from the IDE version of this lab?

The IDE variant grades by re-executing standalone .py files from the workspace; the Jupyter variant grades kernel-resident variables. It's a stricter signal — the checker sees exactly what your last-run cells produced, not what a fresh interpreter would produce. It also means the assertion messages point back to specific notebook variables (naive_times, tune_tokens, streamed_text, ttft), which tend to be more helpful than file-path errors when you're iterating on a cell and want to re-check without a context switch.