GPU Observability: From nvidia-smi to a Production Monitoring Stack
Go from a raw NVML snapshot to a real monitoring pipeline: capture live GPU telemetry during a workload, diagnose a dataloader bottleneck from the utilization trace, and expose everything as a Prometheus /metrics endpoint.
What you'll learn
- 1Raw GPU telemetry via NVML
- 2Time-series sampler during a real workload
- 3Diagnose the classic data-loader bottleneck
- 4Production bridge: expose metrics in Prometheus format
Prerequisites
- Comfortable with Python threads and background loops
- Basic PyTorch (tensors, dataloaders)
- Conceptual familiarity with Prometheus metrics format
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this GPU observability lab
GPU observability is table stakes for anyone running AI in production — if you can't tell the difference between a card at 90% 'utilization' that's actually Tensor-Core-bound and a card at 90% that's running a badly-coalesced kernel on a single SM, you can't size your fleet, can't diagnose a regression, and can't justify the spend. This lab takes you from a single nvidia-smi snapshot to a live Prometheus /metrics endpoint a Grafana dashboard can scrape in 40 minutes. You'll leave with a pynvml-based telemetry helper, a 100 ms background sampler running against a live matmul workload, the canonical dataloader-bottleneck diagnostic (CPU-side preprocessing vs GPU-resident data, with the 10+ percentage-point utilization gap made visible), and a prometheus_client Gauge exporter returning valid exposition format with # HELP and # TYPE lines. Once those gauges update correctly, dcgm-exporter + Grafana is a drop-in upgrade — this lab teaches you what it's actually doing.
The core technical lesson is that NVML's nvmlDeviceGetUtilizationRates().gpu — the number nvidia-smi prints — is 'percent of time at least one kernel was resident', not SM occupancy, not Tensor Core activity, not DRAM bandwidth. You can pin it at 100% with a deliberately bad kernel that uses one SM. Real production dashboards layer second-order metrics on top: DCGM_FI_DEV_SM_ACTIVE (SM activity), DCGM_FI_DEV_TENSOR_ACTIVE (Tensor Core utilization), DCGM_FI_DEV_DRAM_ACTIVE (memory bandwidth), DCGM_FI_DEV_PCIE_TX_BYTES / NVLINK_TX_BYTES (interconnect throughput), DCGM_FI_DEV_XID_ERRORS (hardware faults), plus step time, tokens/sec, and MFU — the only metric that tells you whether you're actually using the silicon you're paying for. 100 ms sampling is the right cadence: fast enough to catch dataloader gaps on 30-200 ms batch boundaries, slow enough that the sampler thread doesn't perturb the workload it measures. For cluster-wide monitoring you scrape dcgm-exporter at 1-5 s; for profiling a specific run, 10-100 ms is where teams land. Prometheus exposition format is the universal currency — once your endpoint is valid, Grafana, Alertmanager, VictoriaMetrics, Thanos, Mimir, and Datadog's OpenMetrics ingest all speak the same dialect, and you get alerting, long-term storage, and federation for free.
Prereqs: Python threads and background loops, basic PyTorch (tensors, dataloaders), and conceptual familiarity with the Prometheus metrics format. Preinstalled on the lab pod: pynvml, prometheus_client, PyTorch, the NVIDIA driver. Grading is concrete at every cell — gpu_info must contain all six named fields in plausible ranges, the sampler must collect ≥20 time-series samples with mean utilization >20%, the dataloader fix must close the gap by ≥10 percentage points, and the scraped /metrics must contain all three custom gauges plus valid # HELP / # TYPE headers — so what you build is a functioning exporter, not a toy.
Frequently asked questions
Is NVML GPU_UTIL the same as SM occupancy?
GPU_UTIL the same as SM occupancy?nvmlDeviceGetUtilizationRates().gpu reports the percentage of time in the last sample window during which at least one kernel was executing. It tells you nothing about how many SMs were active, how saturated the Tensor Cores were, or whether the kernel was memory-bound waiting on DRAM. You can hit 100% on a deliberately bad kernel that uses a single SM. For true occupancy you need DCGM's SM_ACTIVE, SM_OCCUPANCY, and TENSOR_ACTIVE fields, or Nsight Compute / PyTorch Profiler on the kernel of interest.Why expose metrics in Prometheus exposition format instead of pushing JSON?
# HELP/# TYPE/metric_name{labels} value timestamp format. Once your gauge endpoint is valid, you get alerting, long-term storage, recording rules, and multi-cluster federation for free. The lab writes a minimal exposition endpoint on purpose — so you see that dcgm-exporter is not magic, it's the same pattern with a larger metric catalogue.When should I reach for DCGM or dcgm-exporter instead of raw NVML?
DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_SM_ACTIVE, DCGM_FI_DEV_DRAM_ACTIVE, DCGM_FI_DEV_TENSOR_ACTIVE, DCGM_FI_DEV_PCIE_TX_BYTES, and so on — exactly the second-order metrics the reflection step pushes you to add. dcgm-exporter is the productionised wrapper: it runs as a DaemonSet on every GPU node, exposes a scrape target, and auto-labels metrics with gpu, modelName, and UUID. For a single workload or a development pod, raw pynvml is faster and lighter; for a cluster, run dcgm-exporter.Why is 100 ms the right sampling cadence?
How does the dataloader fix actually close the utilization gap?
bad_history run, each batch is preprocessed on the CPU and copied to the GPU synchronously before the forward pass — the GPU sits idle waiting on H2D. The trace shows classic 'sawtooth' utilization: spikes during compute, drops to zero during copy. Moving the preprocessing to the GPU (or using pin_memory=True + num_workers>0 + non_blocking=True for async copies, or DALI for GPU-resident decode and augment) overlaps transfer with compute so the GPU stays hot. The 10+ percentage-point gain in the good run is the recovered idle time. The validator just needs good_mean - bad_mean >= 10.What other metrics belong on a real training dashboard besides GPU util?
SM_ACTIVE from DCGM, Tensor Core activity, CPU-side dataloader queue depth, and host-to-device bandwidth. The reflection step specifically asks you to pick the ones that would catch 'high utilization but the job is still slow' — the answer usually starts with MFU and step time.