GPU Cost & Efficiency Audit
Build a four-stage cost-audit pipeline — measure, classify, price, recommend — that turns raw NVML samples into dollar-denominated waste and specific remediation actions. The skeleton behind every enterprise GPU cost product.
What you'll learn
- 1Sample a live workload
- 2Classify every second into a phase
- 3Attach dollars to the waste
- 4The recommender
Prerequisites
- Basic PyTorch training loop
- Familiarity with nvidia-smi / NVML concepts (utilization, memory, power)
- Comfortable with Python threading and decorators
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this GPU cost audit lab
GPU bills dominate AI infra budgets — H100s at $3–5/hour, A100s at $1.50–$3, and fleets that sit at 15–30% average utilization are the norm, not the exception. That gap is where GPU FinOps lives, and this lab builds the skeleton every enterprise GPU cost product (Kubecost, CoreWeave dashboards, Run:ai reports) runs under the hood: measure with NVML, classify time into waste buckets, price the waste in dollars, recommend remediations. You walk away with a 10 Hz sampler against a live PyTorch workload, a four-bucket utilization classifier, a cost_report with self-consistent hourly/monthly/yearly arithmetic, a rule-based recommender that speaks the real remediation vocabulary (DALI, num_workers, AMP, MPS, MIG, spot), and — most importantly — a mental model for why stacked savings percentages over 100% are the signature of a bad report. Roughly 35 minutes on a real NVIDIA GPU pod we provision.
The technical substance is in the classifier and the combining math, not the plumbing. A single mean utilization number lies: 50% mean could be a dataloader-starved workload bouncing between 100% and 0% (fix with DALI / pin_memory / num_workers) or a steady 50% kernel-bound workload (fix with larger batches, mixed precision, or a smaller SKU) — same mean, completely different remediation. The four-bucket taxonomy — <10% idle, 10-70% underutilized, 70-90% productive, >=90% compute-bound — separates those cases cleanly, and the recommender branches on them: idle-heavy maps to MPS / MIG; underutilized + low memory maps to DALI / num_workers; compute-bound maps to AMP / FP8 / Tensor Cores. Pricing is where the real engineering shows up — spot discounts compose multiplicatively with utilization fixes, MPS and DALI overlap (both recover the same wall-clock waste), and the correct combining rule is 1 - product(1 - f_i) capped by measured baseline waste. If you've ever been asked 'how much can we save on GPUs next quarter', this is the defensible number. NVML sits at the right level for an embedded per-pod probe; when you need fleet telemetry with SM_ACTIVE, DRAM_ACTIVE, TENSOR_ACTIVE, you swap in dcgm-exporter and the rest of the pipeline is unchanged.
Prereqs: a basic PyTorch training loop, comfort with Python threading and decorators, and familiarity with nvidia-smi-style metrics (utilization, memory, power, temperature). Preinstalled on the lab pod: pynvml, PyTorch, the NVIDIA driver, and the CUDA toolkit. Grading is concrete at every cell — the sampler must collect ≥100 observations with real variance; phase percentages must sum to 100; the cost report's monthly projection must equal hourly × 24 × 30 to the cent; the recommender must reference ≥2 entries from a whitelist of real remediation patterns — so you finish with artifacts that would pass a real review, not toy outputs.
Frequently asked questions
Why 10 Hz sampling? Won't that miss bursty kernels?
torch.profiler or Nsight Systems instead; NVML is the right tool for the cost-accounting question, not for kernel-level tuning.Why does the lab split utilization into four buckets instead of a single mean?
My recommender output total_savings > 100%. Is the validator wrong?
total_savings > 100%. Is the validator wrong?1 - product(1 - f_i)), cap by the measured baseline waste, and then multiply by the rate-reduction from spot. The lab intentionally lets you emit the naive sum so you notice the mistake.Do spot-instance savings stack with utilization fixes?
1 - (0.4 × 0.6) = 76%. Spot reduces the rate you pay per GPU-hour; utilization fixes reduce the GPU-hours you need. Different axes. The reflection rubric specifically asks you to separate 'compounding dimensions' (rate × utilization) from 'overlapping dimensions' (MPS vs DALI).Why use pynvml directly instead of DCGM or dcgm-exporter?
pynvml directly instead of DCGM or dcgm-exporter?SM_ACTIVE, DRAM_ACTIVE, TENSOR_CORE_ACTIVE, DCGM_FI_DEV_XID_ERRORS) and a structured metrics pipeline. This lab stays at NVML so you focus on the cost-accounting logic, not on standing up an exporter — the concepts port directly when you swap the sampler for dcgm-exporter queries later.What counts as 'productive' utilization in the classifier?
>=90% as compute-bound. The distinction matters: a workload pinned at 90%+ is already saturating the SMs and your remaining levers are precision (FP16/BF16/FP8), Tensor Core adoption, and fused kernels — not dataloader tweaks. A workload consistently in the 70–90% band has real headroom from bigger batches, overlap between H2D copies and compute, or moving to a smaller GPU SKU. The recommender branches on this: compute-bound rows suggest AMP; underutilized rows suggest num_workers and DALI.