GPU Health Checks + Auto-Remediation
Build a production-grade GPU watchdog: multi-dimensional NVML health probe, rogue-process detection, auto-remediation that kills the offender and verifies recovery, then wire it up with Prometheus alerts and Kubernetes liveness probes.
What you'll learn
- 1Build the health probe
- 2Simulate the #1 production failure: a PID hoarding VRAM
- 3Write the auto-remediator
- 4Wire it into production: Prometheus + Kubernetes + watchdog
Prerequisites
- Comfortable with Python subprocesses and signals (SIGTERM/SIGKILL)
- Basic familiarity with NVML or `nvidia-smi` output
- Conceptual knowledge of Prometheus alerting and Kubernetes probes
Exam domains covered
Skills & technologies you'll practice
This advanced-level gpu lab gives you real-world reps across:
What you'll build in this GPU health and auto-remediation lab
GPU dashboards show you that something is wrong; a watchdog actually does something about it. This lab builds the second one — a production-grade GPU auto-remediator that detects a rogue process hoarding VRAM, kills it safely, verifies real recovery by re-querying NVML, and emits Prometheus alerts and Kubernetes liveness signals so the rest of your SRE stack can respond. The failure mode it's built for is the single most common one in real fleets: a Python training script crashes holding a 2 GiB torch.cuda allocation, never releases it, and the next job on the card fails with CUDA out of memory while nvidia-smi shows a perfectly calm GPU. You leave with a working health_check() covering memory/temperature/ECC/power/Xid, a remediation function that measures actual freed VRAM, prometheus_rules YAML with real alert expressions, richer liveness exit codes (0 healthy, 2 software fault, 3 hardware fault), and a watchdog loop you can drop into a sidecar. ~50 minutes on a real NVIDIA GPU pod we provision.
The technical substance lives in the details that separate probes that work from probes that lie. The IDLE throttle bit (0x1) in currentClocksThrottleReasons is always set on an idle GPU, so alerting on any non-zero throttle mask flags 90% of healthy inference boxes; the right mask is throttle & ~0x1 — alert on HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, SW_THERMAL_SLOWDOWN, HW_POWER_BRAKE. Rogue-process enumeration uses nvmlDeviceGetComputeRunningProcesses so you get real per-PID accounting, not a guess from ps. Remediation sends SIGTERM with a grace window (the process gets to flush checkpoints, close NCCL communicators, release the CUDA context) before escalating to SIGKILL — the same pattern Kubernetes terminationGracePeriodSeconds encodes at pod level. The trap the lab intentionally trips: if you don't re-query nvmlDeviceGetMemoryInfo after the kill, memory_freed_mib comes out zero or negative, and the validator catches it with a >500 MiB assertion. The reflection step is the real lesson — auto-remediation is dangerous for ECC errors, Xid hardware faults, thermal runaway, and anything involving multi-node NCCL state, so production policy needs cooldowns, rate limits, allowlists, and human-in-the-loop gates.
Prereqs: Python subprocesses and signals (SIGTERM / SIGKILL), basic familiarity with NVML or nvidia-smi output, and conceptual knowledge of Prometheus alerting and Kubernetes probes. Preinstalled on the lab pod: pynvml, PyTorch, the CUDA toolkit, the NVIDIA driver. Grading is concrete at every step — baseline probe returns healthy with zero issues; after the simulated leak it escalates to degraded/critical and lists the rogue PID via per-process accounting; the remediator frees >500 MiB and returns the card to healthy; the alert YAML defines ≥2 rules with the canonical Prometheus alert: / expr: / for: / severity fields. Pair this with the GPU monitoring lab (telemetry at rest) and you have the two halves of production GPU SRE: detection plus response.
Frequently asked questions
Why does the lab insist you re-query NVML after the kill?
nvmlDeviceGetMemoryInfo reports the leaked memory as 'used'. If you compute memory_freed_mib = pre_memory_mib - pre_memory_mib, you'll get zero and report the remediation as a no-op. If you cache the pre-kill value and subtract a stale post-kill value, you can accidentally report negative savings. The only trustworthy order of operations is: read pre, kill, wait for NVML to update (the device context takes a moment to tear down), read post, then subtract. The validator specifically checks that memory_freed_mib > 500 — anything lower almost always means you skipped the post-kill query.Why ignore the IDLE throttle bit (0x1) in the health probe?
currentClocksThrottleReasons value as degraded, your probe will be 'critical' 90% of the time on a healthy inference box. The production pattern is a mask: throttle & ~0x1 — alert on HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, SW_THERMAL_SLOWDOWN, or HW_POWER_BRAKE, ignore IDLE and (usually) APPLICATIONS_CLOCKS_SETTING.SIGTERM then SIGKILL — why not just SIGKILL right away?
atexit / signal handlers it registered) a grace window — typically 10 to 30 seconds — to flush state, release GPU memory, and exit. Only if the grace window expires do you escalate to SIGKILL. Kubernetes' terminationGracePeriodSeconds encodes exactly this idea at the pod level.Is auto-remediation ever a bad idea?
How does this differ from the GPU monitoring lab?
What exit codes should the Kubernetes liveness probe use?
kubectl logs on the crashlooping pod plus the last exit code tells you which kind of outage you're dealing with. The lab just requires healthy -> 0 and unhealthy -> non-zero, but it's worth picking richer codes once you go to production.