GPU Sharing: Streams, MPS, MIG, and the Real Cost of Contention
Measure four ways to share a single GPU — CUDA streams, multi-process time-slicing, MPS, and MIG — and write the production artifacts (start scripts, k8s device-plugin ConfigMaps, MIG geometries) that turn 15%-utilized fleets into 80%-utilized ones.
What you'll learn
- 1In-process sharing: CUDA streams
- 2Multi-process sharing WITHOUT MPS: the hidden penalty
- 3MPS: the production artifact
- 4Kubernetes: time-slicing, MPS, and MIG configs
Prerequisites
- Comfortable with CUDA basics and PyTorch tensors
- Basic Kubernetes — Deployments, ConfigMaps, resource requests
- Understanding of processes vs threads and context switching
Exam domains covered
Skills & technologies you'll practice
This advanced-level gpu lab gives you real-world reps across:
What you'll build in this GPU sharing lab
GPU sharing is the lever every AI platform team pulls when fleet utilization sits at 15% and the CFO asks why. A single H100 or A100-80GB is aggressively overbuilt for most individual inference replicas, and the gap between 'one pod, one GPU' and 'forty small models on one card at 80% utilization' is worth six figures a month on any serious fleet. But 'GPU sharing' is actually four different products — CUDA streams, naive multi-process time-slicing, MPS (Multi-Process Service), and MIG (Multi-Instance GPU) — each with different isolation guarantees, different failure modes, and different Kubernetes wiring. This lab makes you measure all four on the same card: a two-stream GEMM test showing real concurrency overlap, a two-subprocess contention test showing ~2× per-process slowdown without MPS, a full MPS control-daemon configuration, and Kubernetes artifacts for time-slicing ConfigMaps and MIG geometries. You leave with a decision framework — 'pack 40 small inference models onto one A100-80GB: which mode, and what breaks if you pick wrong?' — and the production YAML to implement whichever answer you reached. 45 minutes on a real NVIDIA GPU pod we provision.
The measurements matter because the intuitions are wrong. Time-slicing via nvidia-device-plugin with replicas: N does not give concurrent GPU execution — it round-robins CUDA contexts at the driver with ~10 ms context-switch jitter per swap, and kubectl describe node showing full GPU subscription is cosmetic, not real. You'll see a 1.5-2.5× per-process wall-clock slowdown when two processes share the card without MPS. MPS is the production answer inside one trust boundary: nvidia-cuda-mps-control -d runs a daemon that merges all client CUDA contexts into one, the hardware scheduler then interleaves kernels at the SM level, and CUDA_MPS_ACTIVE_THREAD_PERCENTAGE caps each client's SM share. MPS requires Linux with the GPU in EXCLUSIVE_PROCESS compute mode — WSL2, Docker Desktop, and some container runtimes are out. MIG is the hard-isolation answer, available on A100 / H100 / H200 / B100 / B200 / GH200 and some L40S variants: fixed-shape hardware partitions (1g.5gb, 1g.10gb, 2g.20gb, 3g.20gb, 3g.40gb, 7g.80gb) each with their own L2, memory, and SMs. A noisy neighbour literally cannot steal from you, at the cost of static geometries and a disruptive reset to change them. And the quiet win of CUDA streams is compute/transfer overlap — H2D copies, compute, and D2H on separate streams let the copy engines work in parallel with the SMs, recovering 20-30% of wall-clock on batched inference servers.
Prereqs: CUDA basics and PyTorch tensors, basic Kubernetes (Deployments, ConfigMaps, resources.requests/limits), and a mental model of processes vs threads. Preinstalled on the lab pod: PyTorch, the CUDA toolkit, nvidia-smi, the NVIDIA driver. Grading is concrete at every step: stream concurrency must show ≥1.05× speedup, multi-process contention must show ≥1.2× per-process slowdown (capped at 3.5× to catch implausible numbers), MPS configs must reference nvidia-cuda-mps-control -d plus the CUDA_MPS_PIPE_DIRECTORY / CUDA_MPS_LOG_DIRECTORY / CUDA_MPS_ACTIVE_THREAD_PERCENTAGE env vars and the echo quit | nvidia-cuda-mps-control shutdown pattern, and MIG profiles must reference at least one canonical NVIDIA slice shape. The reflection step forces the real-world call: pack 40 small models onto one A100-80GB — time-slicing fails on latency, MPS scales throughput but gives no memory isolation, MIG gives hard isolation at the cost of fixed partitions. There's a right answer, and the numbers in the lab prove it.
Frequently asked questions
Does time-slicing give me concurrent GPU execution?
nvidia-device-plugin with replicas: N simply lets N pods schedule onto the same GPU; the driver still runs their CUDA contexts one at a time in a round-robin with ~10 ms switch overhead. You get higher GPU reservation utilization (kubectl describe node shows full subscription) but you do not get parallel SM execution. For actual concurrency inside a single tenant you need MPS; for concurrency with isolation across tenants you need MIG.When does MPS beat MIG and when does MIG beat MPS?
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE caps each client's SM share without a hard partition. MIG wins when you need hardware isolation for SLA or compliance reasons — each slice has its own L2, memory, and SMs, so a noisy neighbor literally cannot steal from you. The cost is fixed-shape partitions (1g.5gb, 3g.40gb, 7g.80gb) chosen at provisioning time; you cannot resize a MIG geometry on the fly without reset.Why does default multi-process sharing without MPS cause a 2× slowdown?
Does MPS work on WSL2, Docker Desktop, or consumer RTX cards?
nvidia-smi -c EXCLUSIVE_PROCESS), plus enough privilege to start the MPS control daemon. That rules out WSL2 and Docker Desktop. It works fine on bare-metal Linux and on Linux containers with --ipc=host access to the MPS pipe directory. Consumer cards (RTX 3090, 4090, 5090) do support MPS in theory, but MIG is a datacenter-only feature — only A100, H100, H200, B100, B200, and some L40S / GH200 variants support hardware partitioning.How do I pick a MIG geometry for a mixed inference fleet?
1g.10gb x7 on an 80 GB A100; for a mix of small and mid-sized models 3g.40gb + 2g.20gb + 2g.20gb makes sense; for a couple of large LLM shards 3g.40gb x2 or 7g.80gb x1. The lab's mig_profiles dict encodes exactly this shape — name, target hardware, partitions list, and the use case each one was designed for. Remember a MIG reset is disruptive, so over-partition slightly in anticipation of workload growth.