GPU Sharing: Streams, MPS, MIG, and the Real Cost of Contention
GPU sandbox · jupyter
Beta

GPU Sharing: Streams, MPS, MIG, and the Real Cost of Contention

Measure four ways to share a single GPU — CUDA streams, multi-process time-slicing, MPS, and MIG — and write the production artifacts (start scripts, k8s device-plugin ConfigMaps, MIG geometries) that turn 15%-utilized fleets into 80%-utilized ones.

45 min·4 steps·3 domains·Advanced·ncp-aionca-aiioncp-adsncp-aii

What you'll learn

  1. 1
    In-process sharing: CUDA streams
  2. 2
    Multi-process sharing WITHOUT MPS: the hidden penalty
  3. 3
    MPS: the production artifact
  4. 4
    Kubernetes: time-slicing, MPS, and MIG configs

Prerequisites

  • Comfortable with CUDA basics and PyTorch tensors
  • Basic Kubernetes — Deployments, ConfigMaps, resource requests
  • Understanding of processes vs threads and context switching

Exam domains covered

GPU Acceleration & Distributed TrainingModel Deployment & Inference OptimizationAI Infrastructure & Operations

Skills & technologies you'll practice

This advanced-level gpu lab gives you real-world reps across:

CUDA StreamsMPSMIGKubernetesDevice PluginGPU UtilizationMulti-tenancynvidia-smi

What you'll build in this GPU sharing lab

GPU sharing is the lever every AI platform team pulls when fleet utilization sits at 15% and the CFO asks why. A single H100 or A100-80GB is aggressively overbuilt for most individual inference replicas, and the gap between 'one pod, one GPU' and 'forty small models on one card at 80% utilization' is worth six figures a month on any serious fleet. But 'GPU sharing' is actually four different products — CUDA streams, naive multi-process time-slicing, MPS (Multi-Process Service), and MIG (Multi-Instance GPU) — each with different isolation guarantees, different failure modes, and different Kubernetes wiring. This lab makes you measure all four on the same card: a two-stream GEMM test showing real concurrency overlap, a two-subprocess contention test showing ~2× per-process slowdown without MPS, a full MPS control-daemon configuration, and Kubernetes artifacts for time-slicing ConfigMaps and MIG geometries. You leave with a decision framework — 'pack 40 small inference models onto one A100-80GB: which mode, and what breaks if you pick wrong?' — and the production YAML to implement whichever answer you reached. 45 minutes on a real NVIDIA GPU pod we provision.

The measurements matter because the intuitions are wrong. Time-slicing via nvidia-device-plugin with replicas: N does not give concurrent GPU execution — it round-robins CUDA contexts at the driver with ~10 ms context-switch jitter per swap, and kubectl describe node showing full GPU subscription is cosmetic, not real. You'll see a 1.5-2.5× per-process wall-clock slowdown when two processes share the card without MPS. MPS is the production answer inside one trust boundary: nvidia-cuda-mps-control -d runs a daemon that merges all client CUDA contexts into one, the hardware scheduler then interleaves kernels at the SM level, and CUDA_MPS_ACTIVE_THREAD_PERCENTAGE caps each client's SM share. MPS requires Linux with the GPU in EXCLUSIVE_PROCESS compute mode — WSL2, Docker Desktop, and some container runtimes are out. MIG is the hard-isolation answer, available on A100 / H100 / H200 / B100 / B200 / GH200 and some L40S variants: fixed-shape hardware partitions (1g.5gb, 1g.10gb, 2g.20gb, 3g.20gb, 3g.40gb, 7g.80gb) each with their own L2, memory, and SMs. A noisy neighbour literally cannot steal from you, at the cost of static geometries and a disruptive reset to change them. And the quiet win of CUDA streams is compute/transfer overlap — H2D copies, compute, and D2H on separate streams let the copy engines work in parallel with the SMs, recovering 20-30% of wall-clock on batched inference servers.

Prereqs: CUDA basics and PyTorch tensors, basic Kubernetes (Deployments, ConfigMaps, resources.requests/limits), and a mental model of processes vs threads. Preinstalled on the lab pod: PyTorch, the CUDA toolkit, nvidia-smi, the NVIDIA driver. Grading is concrete at every step: stream concurrency must show ≥1.05× speedup, multi-process contention must show ≥1.2× per-process slowdown (capped at 3.5× to catch implausible numbers), MPS configs must reference nvidia-cuda-mps-control -d plus the CUDA_MPS_PIPE_DIRECTORY / CUDA_MPS_LOG_DIRECTORY / CUDA_MPS_ACTIVE_THREAD_PERCENTAGE env vars and the echo quit | nvidia-cuda-mps-control shutdown pattern, and MIG profiles must reference at least one canonical NVIDIA slice shape. The reflection step forces the real-world call: pack 40 small models onto one A100-80GB — time-slicing fails on latency, MPS scales throughput but gives no memory isolation, MIG gives hard isolation at the cost of fixed partitions. There's a right answer, and the numbers in the lab prove it.

Frequently asked questions

Does time-slicing give me concurrent GPU execution?

No — this is the most common misconception in the space. Time-slicing via the nvidia-device-plugin with replicas: N simply lets N pods schedule onto the same GPU; the driver still runs their CUDA contexts one at a time in a round-robin with ~10 ms switch overhead. You get higher GPU reservation utilization (kubectl describe node shows full subscription) but you do not get parallel SM execution. For actual concurrency inside a single tenant you need MPS; for concurrency with isolation across tenants you need MIG.

When does MPS beat MIG and when does MIG beat MPS?

MPS wins on total throughput when tenants trust each other and workloads have different bottlenecks — MPS merges contexts so the scheduler can interleave memory-bound and compute-bound kernels at the SM level, and CUDA_MPS_ACTIVE_THREAD_PERCENTAGE caps each client's SM share without a hard partition. MIG wins when you need hardware isolation for SLA or compliance reasons — each slice has its own L2, memory, and SMs, so a noisy neighbor literally cannot steal from you. The cost is fixed-shape partitions (1g.5gb, 3g.40gb, 7g.80gb) chosen at provisioning time; you cannot resize a MIG geometry on the fly without reset.

Why does default multi-process sharing without MPS cause a 2× slowdown?

Because two CUDA processes sharing a GPU each need their own CUDA context, and the driver must switch contexts to service each one. Context switches flush caches, save and restore register state, and invalidate SM allocations — so on average each process sees roughly half the wall-clock throughput it would running alone. MPS eliminates this by merging all client contexts into one daemon-owned context, so the hardware scheduler sees a single stream of kernels and can actually overlap them on different SMs.

Does MPS work on WSL2, Docker Desktop, or consumer RTX cards?

MPS needs a Linux host with the GPU in Exclusive_Process compute mode (nvidia-smi -c EXCLUSIVE_PROCESS), plus enough privilege to start the MPS control daemon. That rules out WSL2 and Docker Desktop. It works fine on bare-metal Linux and on Linux containers with --ipc=host access to the MPS pipe directory. Consumer cards (RTX 3090, 4090, 5090) do support MPS in theory, but MIG is a datacenter-only feature — only A100, H100, H200, B100, B200, and some L40S / GH200 variants support hardware partitioning.

How do I pick a MIG geometry for a mixed inference fleet?

Size each partition by the largest model-replica pair you need to fit, then choose the canonical geometry whose partitions match. For a 7-instance layout of small models you'd pick 1g.10gb x7 on an 80 GB A100; for a mix of small and mid-sized models 3g.40gb + 2g.20gb + 2g.20gb makes sense; for a couple of large LLM shards 3g.40gb x2 or 7g.80gb x1. The lab's mig_profiles dict encodes exactly this shape — name, target hardware, partitions list, and the use case each one was designed for. Remember a MIG reset is disruptive, so over-partition slightly in anticipation of workload growth.

Are CUDA streams useful if my process only runs one kernel at a time?

Yes — the hidden win is overlap between compute and memory transfer, not between two compute kernels. Assigning H2D copies, compute, and D2H copies to different streams lets the copy engines and the SMs work in parallel on the same GPU. For an inference server that's batching requests, that overlap alone can recover 20–30% of wall-clock. Multiple compute streams only help when the kernels are small enough that a single one doesn't saturate all SMs — which is exactly the case for small transformer layers at low batch size.