Question 1

Why does the lab insist you re-query NVML *after* the kill?

Accepted Answer

Because before the kill, `nvmlDeviceGetMemoryInfo` reports the leaked memory as 'used'. If you compute `memory_freed_mib = pre_memory_mib - pre_memory_mib`, you'll get zero and report the remediation as a no-op. If you cache the pre-kill value and subtract a stale post-kill value, you can accidentally report negative savings. The only trustworthy order of operations is: read pre, kill, wait for NVML to update (the device context takes a moment to tear down), read post, then subtract. The validator specifically checks that `memory_freed_mib > 500` — anything lower almost always means you skipped the post-kill query.

Question 2

Why ignore the IDLE throttle bit (0x1) in the health probe?

Accepted Answer

Every idle GPU reports the IDLE throttle reason — that's literally what 'the GPU is not currently under load' means to NVML. If you flag every non-zero `currentClocksThrottleReasons` value as degraded, your probe will be 'critical' 90% of the time on a healthy inference box. The production pattern is a mask: `throttle & ~0x1` — alert on HW_SLOWDOWN, HW_THERMAL_SLOWDOWN, SW_THERMAL_SLOWDOWN, or HW_POWER_BRAKE, ignore IDLE and (usually) APPLICATIONS_CLOCKS_SETTING.

Question 3

SIGTERM then SIGKILL — why not just SIGKILL right away?

Accepted Answer

Because SIGKILL cannot be caught and gives the process zero chance to release resources cleanly. A PyTorch training job killed mid-step can leave corrupted checkpoint files, dangling NCCL communicators, or a half-written optimizer state that takes down the next replica when it tries to resume. SIGTERM gives the process (and any `atexit` / `signal` handlers it registered) a grace window — typically 10 to 30 seconds — to flush state, release GPU memory, and exit. Only if the grace window expires do you escalate to SIGKILL. Kubernetes' `terminationGracePeriodSeconds` encodes exactly this idea at the pod level.

Question 4

Is auto-remediation ever a bad idea?

Accepted Answer

Yes, and the reflection step is built around exactly this question. ECC errors, Xid hardware faults, thermal runaway, and anything involving multi-node training state (NCCL collectives, checkpoint coordination, allreduce stragglers) are all cases where the safest action is to page a human and stop the bleed, not kill the PID and pretend it's fixed. The pattern is: observe for well-characterized failure modes first, auto-remediate only for the ones where the failure mode is understood and the blast radius is bounded. Everything else is a PagerDuty event.

Question 5

How does this differ from the GPU monitoring lab?

Accepted Answer

The monitoring lab is about telemetry at rest — DCGM-style metric pipelines, Prometheus gauges, dashboards that help you notice a problem. This lab is about incident response — detecting a broken GPU, making a decision, taking an action, and proving the action worked. You'll often deploy both: dcgm-exporter surfaces the raw signal on a dashboard, and this watchdog is the component that translates a degraded signal into 'kill the rogue PID, alert on the hardware fault, restart the pod' without a human in the loop.

Question 6

What exit codes should the Kubernetes liveness probe use?

Accepted Answer

Exit 0 on healthy, any non-zero on unhealthy — kubelet interprets any non-zero exit as a probe failure. The convention many teams use is to reserve a namespace: exit 2 for software-level failure (can't import torch, NVML init failed), exit 3 for hardware-level failure (ECC, Xid, temperature), exit 0 for healthy. That way `kubectl logs` on the crashlooping pod plus the last exit code tells you which kind of outage you're dealing with. The lab just requires `healthy -> 0` and `unhealthy -> non-zero`, but it's worth picking richer codes once you go to production.

GPU Health Checks + Auto-Remediation

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU health and auto-remediation lab

Frequently asked questions

Why does the lab insist you re-query NVML after the kill?

Why ignore the IDLE throttle bit (0x1) in the health probe?

SIGTERM then SIGKILL — why not just SIGKILL right away?

Is auto-remediation ever a bad idea?

How does this differ from the GPU monitoring lab?

What exit codes should the Kubernetes liveness probe use?