Question 1

Is NVML `GPU_UTIL` the same as SM occupancy?

Accepted Answer

No — this is the single most important misconception the lab targets. NVML's `nvmlDeviceGetUtilizationRates().gpu` reports the percentage of time in the last sample window during which at least one kernel was executing. It tells you nothing about how many SMs were active, how saturated the Tensor Cores were, or whether the kernel was memory-bound waiting on DRAM. You can hit 100% on a deliberately bad kernel that uses a single SM. For true occupancy you need DCGM's `SM_ACTIVE`, `SM_OCCUPANCY`, and `TENSOR_ACTIVE` fields, or Nsight Compute / PyTorch Profiler on the kernel of interest.

Question 2

Why expose metrics in Prometheus exposition format instead of pushing JSON?

Accepted Answer

Because Prometheus is pull-based, stateless on the scraper side, cheap to horizontally scale, and universally supported — Grafana, Alertmanager, VictoriaMetrics, Thanos, Cortex, Mimir, Datadog's OpenMetrics ingest, all of them speak the same `# HELP`/`# TYPE`/`metric_name{labels} value timestamp` format. Once your gauge endpoint is valid, you get alerting, long-term storage, recording rules, and multi-cluster federation for free. The lab writes a minimal exposition endpoint on purpose — so you see that dcgm-exporter is not magic, it's the same pattern with a larger metric catalogue.

Question 3

When should I reach for DCGM or dcgm-exporter instead of raw NVML?

Accepted Answer

When you want fleet-level telemetry or richer fields than NVML exposes. DCGM adds `DCGM_FI_DEV_XID_ERRORS`, `DCGM_FI_DEV_SM_ACTIVE`, `DCGM_FI_DEV_DRAM_ACTIVE`, `DCGM_FI_DEV_TENSOR_ACTIVE`, `DCGM_FI_DEV_PCIE_TX_BYTES`, and so on — exactly the second-order metrics the reflection step pushes you to add. dcgm-exporter is the productionised wrapper: it runs as a DaemonSet on every GPU node, exposes a scrape target, and auto-labels metrics with `gpu`, `modelName`, and `UUID`. For a single workload or a development pod, raw `pynvml` is faster and lighter; for a cluster, run dcgm-exporter.

Question 4

Why is 100 ms the right sampling cadence?

Accepted Answer

It's a pragmatic compromise. NVML query latency is on the order of a few milliseconds per field, so 100 ms leaves plenty of headroom without producing a crushing volume of time-series. It's fast enough to catch the GPU-idle gaps in a slow dataloader (batches typically arrive on the 30–200 ms cadence), and slow enough that your sampler thread doesn't perturb the workload it's measuring. For production monitoring via dcgm-exporter, 1–5 s scrape intervals are standard; for profiling a specific run, 10–100 ms is where most teams land.

Question 5

How does the dataloader fix actually close the utilization gap?

Accepted Answer

In the `bad_history` run, each batch is preprocessed on the CPU and copied to the GPU synchronously before the forward pass — the GPU sits idle waiting on H2D. The trace shows classic 'sawtooth' utilization: spikes during compute, drops to zero during copy. Moving the preprocessing to the GPU (or using `pin_memory=True` + `num_workers>0` + `non_blocking=True` for async copies, or DALI for GPU-resident decode and augment) overlaps transfer with compute so the GPU stays hot. The 10+ percentage-point gain in the good run is the recovered idle time. The validator just needs `good_mean - bad_mean >= 10`.

Question 6

What other metrics belong on a real training dashboard besides GPU util?

Accepted Answer

Step time (seconds per batch — the truth metric), tokens per second, MFU (Model FLOPS Utilization — achieved FLOPS divided by peak theoretical FLOPS, the only way to know if you're actually using the hardware), gradient allreduce time for distributed runs, PCIe and NVLink bytes for multi-GPU topology bottlenecks, `SM_ACTIVE` from DCGM, Tensor Core activity, CPU-side dataloader queue depth, and host-to-device bandwidth. The reflection step specifically asks you to pick the ones that would catch 'high utilization but the job is still slow' — the answer usually starts with MFU and step time.

GPU Observability: From nvidia-smi to a Production Monitoring Stack

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU observability lab

Frequently asked questions

Is NVML `GPU_UTIL` the same as SM occupancy?

Why expose metrics in Prometheus exposition format instead of pushing JSON?

When should I reach for DCGM or dcgm-exporter instead of raw NVML?

Why is 100 ms the right sampling cadence?

How does the dataloader fix actually close the utilization gap?

What other metrics belong on a real training dashboard besides GPU util?

GPU Observability: From nvidia-smi to a Production Monitoring Stack

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU observability lab

Frequently asked questions

Is NVML GPU_UTIL the same as SM occupancy?

Why expose metrics in Prometheus exposition format instead of pushing JSON?

When should I reach for DCGM or dcgm-exporter instead of raw NVML?

Why is 100 ms the right sampling cadence?

How does the dataloader fix actually close the utilization gap?

What other metrics belong on a real training dashboard besides GPU util?

Is NVML `GPU_UTIL` the same as SM occupancy?