GPU Cost & Efficiency Audit
GPU sandbox · jupyter
Beta

GPU Cost & Efficiency Audit

Build a four-stage cost-audit pipeline — measure, classify, price, recommend — that turns raw NVML samples into dollar-denominated waste and specific remediation actions. The skeleton behind every enterprise GPU cost product.

35 min·4 steps·2 domains·Intermediate·ncp-aionca-aiionca-genlncp-aii

What you'll learn

  1. 1
    Sample a live workload
  2. 2
    Classify every second into a phase
  3. 3
    Attach dollars to the waste
  4. 4
    The recommender

Prerequisites

  • Basic PyTorch training loop
  • Familiarity with nvidia-smi / NVML concepts (utilization, memory, power)
  • Comfortable with Python threading and decorators

Exam domains covered

GPU Infrastructure & OperationsModel Deployment & Inference Optimization

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

pynvmlCost OptimizationUtilizationDCGMMPSMIGSpot InstancesProfiling

What you'll build in this GPU cost audit lab

GPU bills dominate AI infra budgets — H100s at $3–5/hour, A100s at $1.50–$3, and fleets that sit at 15–30% average utilization are the norm, not the exception. That gap is where GPU FinOps lives, and this lab builds the skeleton every enterprise GPU cost product (Kubecost, CoreWeave dashboards, Run:ai reports) runs under the hood: measure with NVML, classify time into waste buckets, price the waste in dollars, recommend remediations. You walk away with a 10 Hz sampler against a live PyTorch workload, a four-bucket utilization classifier, a cost_report with self-consistent hourly/monthly/yearly arithmetic, a rule-based recommender that speaks the real remediation vocabulary (DALI, num_workers, AMP, MPS, MIG, spot), and — most importantly — a mental model for why stacked savings percentages over 100% are the signature of a bad report. Roughly 35 minutes on a real NVIDIA GPU pod we provision.

The technical substance is in the classifier and the combining math, not the plumbing. A single mean utilization number lies: 50% mean could be a dataloader-starved workload bouncing between 100% and 0% (fix with DALI / pin_memory / num_workers) or a steady 50% kernel-bound workload (fix with larger batches, mixed precision, or a smaller SKU) — same mean, completely different remediation. The four-bucket taxonomy — <10% idle, 10-70% underutilized, 70-90% productive, >=90% compute-bound — separates those cases cleanly, and the recommender branches on them: idle-heavy maps to MPS / MIG; underutilized + low memory maps to DALI / num_workers; compute-bound maps to AMP / FP8 / Tensor Cores. Pricing is where the real engineering shows up — spot discounts compose multiplicatively with utilization fixes, MPS and DALI overlap (both recover the same wall-clock waste), and the correct combining rule is 1 - product(1 - f_i) capped by measured baseline waste. If you've ever been asked 'how much can we save on GPUs next quarter', this is the defensible number. NVML sits at the right level for an embedded per-pod probe; when you need fleet telemetry with SM_ACTIVE, DRAM_ACTIVE, TENSOR_ACTIVE, you swap in dcgm-exporter and the rest of the pipeline is unchanged.

Prereqs: a basic PyTorch training loop, comfort with Python threading and decorators, and familiarity with nvidia-smi-style metrics (utilization, memory, power, temperature). Preinstalled on the lab pod: pynvml, PyTorch, the NVIDIA driver, and the CUDA toolkit. Grading is concrete at every cell — the sampler must collect ≥100 observations with real variance; phase percentages must sum to 100; the cost report's monthly projection must equal hourly × 24 × 30 to the cent; the recommender must reference ≥2 entries from a whitelist of real remediation patterns — so you finish with artifacts that would pass a real review, not toy outputs.

Frequently asked questions

Why 10 Hz sampling? Won't that miss bursty kernels?

10 Hz is NVML's comfortable operating range — NVML queries have per-call overhead and coarser averaging windows inside the driver, so sampling much faster mostly buys you correlated samples rather than resolution. For the classification task in step 2 — 'what fraction of seconds were idle vs productive?' — 10 Hz is plenty. If you want sub-millisecond visibility into kernel-level stalls you use torch.profiler or Nsight Systems instead; NVML is the right tool for the cost-accounting question, not for kernel-level tuning.

Why does the lab split utilization into four buckets instead of a single mean?

A single mean hides the shape. A GPU at 50% mean utilization could be bouncing between 100% and 0% every other second (dataloader-starved, fixable with DALI) or holding steady at 50% (kernel throughput-bound, fixable with larger batches or mixed precision) — same mean, completely different remediation. The four-bucket taxonomy — idle / underutilized / productive / compute-bound — separates those cases cleanly and maps each one to a different row in the recommender.

My recommender output total_savings > 100%. Is the validator wrong?

No, the validator is deliberately lenient and that's exactly the antipattern the reflection step calls out. MPS (reclaim idle time) and a DALI upgrade (reclaim underutilized time) partially overlap because they both recover the same wall-clock waste. You can't sum them. The principled combining rule is to convert each remediation to a reclaimed-fraction-of-wall-clock, union them (1 - product(1 - f_i)), cap by the measured baseline waste, and then multiply by the rate-reduction from spot. The lab intentionally lets you emit the naive sum so you notice the mistake.

Do spot-instance savings stack with utilization fixes?

They compose multiplicatively, not additively. A 60% spot discount on top of a fix that reclaims 40% of wasted wall-clock doesn't give you 100% savings — it gives you roughly 1 - (0.4 × 0.6) = 76%. Spot reduces the rate you pay per GPU-hour; utilization fixes reduce the GPU-hours you need. Different axes. The reflection rubric specifically asks you to separate 'compounding dimensions' (rate × utilization) from 'overlapping dimensions' (MPS vs DALI).

Why use pynvml directly instead of DCGM or dcgm-exporter?

Nothing about this pipeline requires DCGM, and NVML is the right level of abstraction for a per-pod cost probe — no extra daemon, no Prometheus dependency, trivial to embed in a training entrypoint. DCGM wins when you want richer fleet-level fields (SM_ACTIVE, DRAM_ACTIVE, TENSOR_CORE_ACTIVE, DCGM_FI_DEV_XID_ERRORS) and a structured metrics pipeline. This lab stays at NVML so you focus on the cost-accounting logic, not on standing up an exporter — the concepts port directly when you swap the sampler for dcgm-exporter queries later.

What counts as 'productive' utilization in the classifier?

The lab uses 70–90% as the productive band and >=90% as compute-bound. The distinction matters: a workload pinned at 90%+ is already saturating the SMs and your remaining levers are precision (FP16/BF16/FP8), Tensor Core adoption, and fused kernels — not dataloader tweaks. A workload consistently in the 70–90% band has real headroom from bigger batches, overlap between H2D copies and compute, or moving to a smaller GPU SKU. The recommender branches on this: compute-bound rows suggest AMP; underutilized rows suggest num_workers and DALI.