GPU Health Checks + Auto-Remediation
GPU sandbox · jupyter
Beta

GPU Health Checks + Auto-Remediation

Build a production-grade GPU watchdog: multi-dimensional NVML health probe, rogue-process detection, auto-remediation that kills the offender and verifies recovery, then wire it up with Prometheus alerts and Kubernetes liveness probes.

50 min·4 steps·2 domains·Advanced·nca-aiioncp-aioncp-aii

What you'll learn

  1. 1
    Build the health probe
  2. 2
    Simulate the #1 production failure: a PID hoarding VRAM
  3. 3
    Write the auto-remediator
  4. 4
    Wire it into production: Prometheus + Kubernetes + watchdog

Prerequisites

  • Comfortable with Python subprocesses and signals (SIGTERM/SIGKILL)
  • Basic familiarity with NVML or `nvidia-smi` output
  • Conceptual knowledge of Prometheus alerting and Kubernetes probes

Exam domains covered

GPU Acceleration & Distributed TrainingModel Deployment & Inference Optimization

Skills & technologies you'll practice

This advanced-level gpu lab gives you real-world reps across:

NVMLpynvmlGPU HealthRemediationPrometheusKubernetesWatchdogSRE

What you'll build in this GPU health and auto-remediation lab

GPU dashboards show you that something is wrong; a watchdog actually does something about it. This lab builds the second one — a production-grade GPU auto-remediator that detects a rogue process hoarding VRAM, kills it safely, verifies real recovery by re-querying NVML, and emits Prometheus alerts and Kubernetes liveness signals so the rest of your SRE stack can respond. The failure mode it's built for is the single most common one in real fleets: a Python training script crashes holding a 2 GiB torch.cuda allocation, never releases it, and the next job on the card fails with CUDA out of memory while nvidia-smi shows a perfectly calm GPU. You leave with a working health_check() covering memory/temperature/ECC/power/Xid, a remediation function that measures actual freed VRAM, prometheus_rules YAML with real alert expressions, richer liveness exit codes (0 healthy, 2 software fault, 3 hardware fault), and a watchdog loop you can drop into a sidecar. ~50 minutes on a real NVIDIA GPU pod we provision.

The technical substance lives in the details that separate probes that work from probes that lie. The IDLE throttle bit (0x1) in currentClocksThrottleReasons is always set on an idle GPU, so alerting on any non-zero throttle mask flags 90% of healthy inference boxes; the right mask is throttle & ~0x1 — alert on HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, SW_THERMAL_SLOWDOWN, HW_POWER_BRAKE. Rogue-process enumeration uses nvmlDeviceGetComputeRunningProcesses so you get real per-PID accounting, not a guess from ps. Remediation sends SIGTERM with a grace window (the process gets to flush checkpoints, close NCCL communicators, release the CUDA context) before escalating to SIGKILL — the same pattern Kubernetes terminationGracePeriodSeconds encodes at pod level. The trap the lab intentionally trips: if you don't re-query nvmlDeviceGetMemoryInfo after the kill, memory_freed_mib comes out zero or negative, and the validator catches it with a >500 MiB assertion. The reflection step is the real lesson — auto-remediation is dangerous for ECC errors, Xid hardware faults, thermal runaway, and anything involving multi-node NCCL state, so production policy needs cooldowns, rate limits, allowlists, and human-in-the-loop gates.

Prereqs: Python subprocesses and signals (SIGTERM / SIGKILL), basic familiarity with NVML or nvidia-smi output, and conceptual knowledge of Prometheus alerting and Kubernetes probes. Preinstalled on the lab pod: pynvml, PyTorch, the CUDA toolkit, the NVIDIA driver. Grading is concrete at every step — baseline probe returns healthy with zero issues; after the simulated leak it escalates to degraded/critical and lists the rogue PID via per-process accounting; the remediator frees >500 MiB and returns the card to healthy; the alert YAML defines ≥2 rules with the canonical Prometheus alert: / expr: / for: / severity fields. Pair this with the GPU monitoring lab (telemetry at rest) and you have the two halves of production GPU SRE: detection plus response.

Frequently asked questions

Why does the lab insist you re-query NVML after the kill?

Because before the kill, nvmlDeviceGetMemoryInfo reports the leaked memory as 'used'. If you compute memory_freed_mib = pre_memory_mib - pre_memory_mib, you'll get zero and report the remediation as a no-op. If you cache the pre-kill value and subtract a stale post-kill value, you can accidentally report negative savings. The only trustworthy order of operations is: read pre, kill, wait for NVML to update (the device context takes a moment to tear down), read post, then subtract. The validator specifically checks that memory_freed_mib > 500 — anything lower almost always means you skipped the post-kill query.

Why ignore the IDLE throttle bit (0x1) in the health probe?

Every idle GPU reports the IDLE throttle reason — that's literally what 'the GPU is not currently under load' means to NVML. If you flag every non-zero currentClocksThrottleReasons value as degraded, your probe will be 'critical' 90% of the time on a healthy inference box. The production pattern is a mask: throttle & ~0x1 — alert on HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, SW_THERMAL_SLOWDOWN, or HW_POWER_BRAKE, ignore IDLE and (usually) APPLICATIONS_CLOCKS_SETTING.

SIGTERM then SIGKILL — why not just SIGKILL right away?

Because SIGKILL cannot be caught and gives the process zero chance to release resources cleanly. A PyTorch training job killed mid-step can leave corrupted checkpoint files, dangling NCCL communicators, or a half-written optimizer state that takes down the next replica when it tries to resume. SIGTERM gives the process (and any atexit / signal handlers it registered) a grace window — typically 10 to 30 seconds — to flush state, release GPU memory, and exit. Only if the grace window expires do you escalate to SIGKILL. Kubernetes' terminationGracePeriodSeconds encodes exactly this idea at the pod level.

Is auto-remediation ever a bad idea?

Yes, and the reflection step is built around exactly this question. ECC errors, Xid hardware faults, thermal runaway, and anything involving multi-node training state (NCCL collectives, checkpoint coordination, allreduce stragglers) are all cases where the safest action is to page a human and stop the bleed, not kill the PID and pretend it's fixed. The pattern is: observe for well-characterized failure modes first, auto-remediate only for the ones where the failure mode is understood and the blast radius is bounded. Everything else is a PagerDuty event.

How does this differ from the GPU monitoring lab?

The monitoring lab is about telemetry at rest — DCGM-style metric pipelines, Prometheus gauges, dashboards that help you notice a problem. This lab is about incident response — detecting a broken GPU, making a decision, taking an action, and proving the action worked. You'll often deploy both: dcgm-exporter surfaces the raw signal on a dashboard, and this watchdog is the component that translates a degraded signal into 'kill the rogue PID, alert on the hardware fault, restart the pod' without a human in the loop.

What exit codes should the Kubernetes liveness probe use?

Exit 0 on healthy, any non-zero on unhealthy — kubelet interprets any non-zero exit as a probe failure. The convention many teams use is to reserve a namespace: exit 2 for software-level failure (can't import torch, NVML init failed), exit 3 for hardware-level failure (ECC, Xid, temperature), exit 0 for healthy. That way kubectl logs on the crashlooping pod plus the last exit code tells you which kind of outage you're dealing with. The lab just requires healthy -> 0 and unhealthy -> non-zero, but it's worth picking richer codes once you go to production.