Question 1

Why does `del model` not free VRAM in a Jupyter notebook?

Accepted Answer

Because IPython silently keeps several references you forgot about. Every cell's return value is cached in `Out[<n>]`, and the last three outputs are aliased to `_`, `__`, and `___`. If any of those point at a tensor, an inputs dict, or a generation output, the model graph they belong to stays pinned. Step 1 has you clear direct references, run `%reset out -f` plus `%reset array -f`, call `gc.collect()`, and only then `torch.cuda.empty_cache()`. After that, `nvidia-smi` confirms the VRAM is actually back — which is required before vLLM boots in Step 2.

Question 2

Why does this lab run vLLM in a terminal instead of launching it from a cell?

Accepted Answer

Because vLLM is a long-running HTTP server, not a library call. Launching it from a notebook cell would block the kernel for the lab's duration. The JupyterLab terminal gives you a dedicated shell next to the notebook where vLLM streams its startup logs, shows the ready banner, and can be restarted with Ctrl+C between Step 3 and Step 4 — all while your notebook kernel keeps your `naive_times` and `vllm_times` lists alive for the final comparison plot.

Question 3

What does PagedAttention actually page?

Accepted Answer

The KV cache. Without paging, every sequence reserves a contiguous VRAM block sized for its `max_model_len`, which wastes 60-80% of the cache on unused padding. PagedAttention splits each sequence's KV cache into fixed 16-token blocks that are allocated on demand from a shared pool, indexed through a block table — structurally identical to how an OS does virtual memory. The payoff is that continuous batching can add a new request mid-decode without needing to find a contiguous slot for it, and that `gpu_cache_usage_perc` in the `/metrics` endpoint you inspect in Step 4 actually reaches 90%+ under load.

Question 4

Why plot tokens-over-time in Step 5 instead of just reporting total latency?

Accepted Answer

Because for chat UX, the shape of the arrival curve matters more than the endpoint. A flat first 2 seconds followed by a burst of tokens feels broken; a 200 ms TTFT followed by a steady 60 tok/sec decode feels responsive even if total time is identical. The Step 5 plot draws `token_indices` against `token_times` and overlays TTFT and total as dashed lines, so you can literally see the prefill/decode split. That's the same instrumentation real serving dashboards expose, just in-notebook.

Question 5

Can I finish the lab and come back later without re-running everything?

Accepted Answer

Yes. The sandbox pod, its kernel state, and any files you've written to the workspace persist between sessions. Close the tab mid-way through Step 3 and when you return the vLLM server may need restarting (it lives in the terminal process, not the kernel), but your `naive_times`, `vllm_times`, and any plots you've drawn are still in the notebook, and `lab.check(N)` will re-inspect them without re-running any cells.

Question 6

How does `lab.check(N)` in a cell differ from the IDE version of this lab?

Accepted Answer

The IDE variant grades by re-executing standalone `.py` files from the workspace; the Jupyter variant grades kernel-resident variables. It's a stricter signal — the checker sees exactly what your last-run cells produced, not what a fresh interpreter would produce. It also means the assertion messages point back to specific notebook variables (`naive_times`, `tune_tokens`, `streamed_text`, `ttft`), which tend to be more helpful than file-path errors when you're iterating on a cell and want to re-check without a context switch.

Deploy & Serve LLMs in Production (Jupyter)

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this vLLM-in-JupyterLab serving lab

Frequently asked questions

Why does `del model` not free VRAM in a Jupyter notebook?

Why does this lab run vLLM in a terminal instead of launching it from a cell?

What does PagedAttention actually page?

Why plot tokens-over-time in Step 5 instead of just reporting total latency?

Can I finish the lab and come back later without re-running everything?

How does `lab.check(N)` in a cell differ from the IDE version of this lab?

Deploy & Serve LLMs in Production (Jupyter)

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this vLLM-in-JupyterLab serving lab

Frequently asked questions

Why does del model not free VRAM in a Jupyter notebook?

Why does this lab run vLLM in a terminal instead of launching it from a cell?

What does PagedAttention actually page?

Why plot tokens-over-time in Step 5 instead of just reporting total latency?

Can I finish the lab and come back later without re-running everything?

How does lab.check(N) in a cell differ from the IDE version of this lab?

Why does `del model` not free VRAM in a Jupyter notebook?

How does `lab.check(N)` in a cell differ from the IDE version of this lab?