Question 1

Does time-slicing give me concurrent GPU execution?

Accepted Answer

No — this is the most common misconception in the space. Time-slicing via the `nvidia-device-plugin` with `replicas: N` simply lets `N` pods schedule onto the same GPU; the driver still runs their CUDA contexts one at a time in a round-robin with ~10 ms switch overhead. You get higher GPU reservation utilization (`kubectl describe node` shows full subscription) but you do not get parallel SM execution. For actual concurrency inside a single tenant you need MPS; for concurrency with isolation across tenants you need MIG.

Question 2

When does MPS beat MIG and when does MIG beat MPS?

Accepted Answer

MPS wins on total throughput when tenants trust each other and workloads have different bottlenecks — MPS merges contexts so the scheduler can interleave memory-bound and compute-bound kernels at the SM level, and `CUDA_MPS_ACTIVE_THREAD_PERCENTAGE` caps each client's SM share without a hard partition. MIG wins when you need hardware isolation for SLA or compliance reasons — each slice has its own L2, memory, and SMs, so a noisy neighbor literally cannot steal from you. The cost is fixed-shape partitions (`1g.5gb`, `3g.40gb`, `7g.80gb`) chosen at provisioning time; you cannot resize a MIG geometry on the fly without reset.

Question 3

Why does default multi-process sharing without MPS cause a 2× slowdown?

Accepted Answer

Because two CUDA processes sharing a GPU each need their own CUDA context, and the driver must switch contexts to service each one. Context switches flush caches, save and restore register state, and invalidate SM allocations — so on average each process sees roughly half the wall-clock throughput it would running alone. MPS eliminates this by merging all client contexts into one daemon-owned context, so the hardware scheduler sees a single stream of kernels and can actually overlap them on different SMs.

Question 4

Does MPS work on WSL2, Docker Desktop, or consumer RTX cards?

Accepted Answer

MPS needs a Linux host with the GPU in Exclusive_Process compute mode (`nvidia-smi -c EXCLUSIVE_PROCESS`), plus enough privilege to start the MPS control daemon. That rules out WSL2 and Docker Desktop. It works fine on bare-metal Linux and on Linux containers with `--ipc=host` access to the MPS pipe directory. Consumer cards (RTX 3090, 4090, 5090) do support MPS in theory, but MIG is a datacenter-only feature — only A100, H100, H200, B100, B200, and some L40S / GH200 variants support hardware partitioning.

Question 5

How do I pick a MIG geometry for a mixed inference fleet?

Accepted Answer

Size each partition by the largest model-replica pair you need to fit, then choose the canonical geometry whose partitions match. For a 7-instance layout of small models you'd pick `1g.10gb` x7 on an 80 GB A100; for a mix of small and mid-sized models `3g.40gb` + `2g.20gb` + `2g.20gb` makes sense; for a couple of large LLM shards `3g.40gb` x2 or `7g.80gb` x1. The lab's `mig_profiles` dict encodes exactly this shape — name, target hardware, partitions list, and the use case each one was designed for. Remember a MIG reset is disruptive, so over-partition slightly in anticipation of workload growth.

Question 6

Are CUDA streams useful if my process only runs one kernel at a time?

Accepted Answer

Yes — the hidden win is overlap between compute and memory transfer, not between two compute kernels. Assigning H2D copies, compute, and D2H copies to different streams lets the copy engines and the SMs work in parallel on the same GPU. For an inference server that's batching requests, that overlap alone can recover 20–30% of wall-clock. Multiple compute streams only help when the kernels are small enough that a single one doesn't saturate all SMs — which is exactly the case for small transformer layers at low batch size.

GPU Sharing: Streams, MPS, MIG, and the Real Cost of Contention

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU sharing lab

Frequently asked questions