Question 1

Why 10 Hz sampling? Won't that miss bursty kernels?

Accepted Answer

10 Hz is NVML's comfortable operating range — NVML queries have per-call overhead and coarser averaging windows inside the driver, so sampling much faster mostly buys you correlated samples rather than resolution. For the classification task in step 2 — 'what fraction of seconds were idle vs productive?' — 10 Hz is plenty. If you want sub-millisecond visibility into kernel-level stalls you use `torch.profiler` or Nsight Systems instead; NVML is the right tool for the cost-accounting question, not for kernel-level tuning.

Question 2

Why does the lab split utilization into four buckets instead of a single mean?

Accepted Answer

A single mean hides the shape. A GPU at 50% mean utilization could be bouncing between 100% and 0% every other second (dataloader-starved, fixable with DALI) or holding steady at 50% (kernel throughput-bound, fixable with larger batches or mixed precision) — same mean, completely different remediation. The four-bucket taxonomy — idle / underutilized / productive / compute-bound — separates those cases cleanly and maps each one to a different row in the recommender.

Question 3

My recommender output `total_savings > 100%`. Is the validator wrong?

Accepted Answer

No, the validator is deliberately lenient and that's exactly the antipattern the reflection step calls out. MPS (reclaim idle time) and a DALI upgrade (reclaim underutilized time) partially overlap because they both recover the same wall-clock waste. You can't sum them. The principled combining rule is to convert each remediation to a reclaimed-fraction-of-wall-clock, union them (`1 - product(1 - f_i)`), cap by the measured baseline waste, and then multiply by the rate-reduction from spot. The lab intentionally lets you emit the naive sum so you notice the mistake.

Question 4

Do spot-instance savings stack with utilization fixes?

Accepted Answer

They compose multiplicatively, not additively. A 60% spot discount on top of a fix that reclaims 40% of wasted wall-clock doesn't give you 100% savings — it gives you roughly `1 - (0.4 × 0.6) = 76%`. Spot reduces the rate you pay per GPU-hour; utilization fixes reduce the GPU-hours you need. Different axes. The reflection rubric specifically asks you to separate 'compounding dimensions' (rate × utilization) from 'overlapping dimensions' (MPS vs DALI).

Question 5

Why use `pynvml` directly instead of DCGM or dcgm-exporter?

Accepted Answer

Nothing about this pipeline requires DCGM, and NVML is the right level of abstraction for a per-pod cost probe — no extra daemon, no Prometheus dependency, trivial to embed in a training entrypoint. DCGM wins when you want richer fleet-level fields (`SM_ACTIVE`, `DRAM_ACTIVE`, `TENSOR_CORE_ACTIVE`, `DCGM_FI_DEV_XID_ERRORS`) and a structured metrics pipeline. This lab stays at NVML so you focus on the cost-accounting logic, not on standing up an exporter — the concepts port directly when you swap the sampler for `dcgm-exporter` queries later.

Question 6

What counts as 'productive' utilization in the classifier?

Accepted Answer

The lab uses 70–90% as the productive band and `>=90%` as compute-bound. The distinction matters: a workload pinned at 90%+ is already saturating the SMs and your remaining levers are precision (FP16/BF16/FP8), Tensor Core adoption, and fused kernels — not dataloader tweaks. A workload consistently in the 70–90% band has real headroom from bigger batches, overlap between H2D copies and compute, or moving to a smaller GPU SKU. The recommender branches on this: compute-bound rows suggest AMP; underutilized rows suggest `num_workers` and DALI.

GPU Cost & Efficiency Audit

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU cost audit lab

Frequently asked questions

Why 10 Hz sampling? Won't that miss bursty kernels?

Why does the lab split utilization into four buckets instead of a single mean?

My recommender output `total_savings > 100%`. Is the validator wrong?

Do spot-instance savings stack with utilization fixes?

Why use `pynvml` directly instead of DCGM or dcgm-exporter?

What counts as 'productive' utilization in the classifier?

GPU Cost & Efficiency Audit

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU cost audit lab

Frequently asked questions

Why 10 Hz sampling? Won't that miss bursty kernels?

Why does the lab split utilization into four buckets instead of a single mean?

My recommender output total_savings > 100%. Is the validator wrong?

Do spot-instance savings stack with utilization fixes?

Why use pynvml directly instead of DCGM or dcgm-exporter?

What counts as 'productive' utilization in the classifier?

My recommender output `total_savings > 100%`. Is the validator wrong?

Why use `pynvml` directly instead of DCGM or dcgm-exporter?