Stuck-Pending Triage Day — Diagnose Any GPU Pod That Won't Run
Hosted
Beta

Stuck-Pending Triage Day — Diagnose Any GPU Pod That Won't Run

The capstone NCA-AIIO operations lab. Walk the five-stage pod lifecycle (admission → scheduling → image pull → runtime → readiness), learn which `kubectl describe` field signals each stuck point, and finish by fixing four broken GPU pods, each broken at a different stage.

40 min·5 steps·2 domains·Intermediate·nca-aiioncp-aiincp-aio

What you'll learn

  1. 1
    The five-gate pod lifecycle and where pods get stuck
    Every pod has to clear five gates between kubectl apply and answering its first request. A pod that's "stuck" is wedged at exactly one of them. Knowing which gate is the difference between fixing it in 30 seconds and chasing the wrong layer for 30 minutes.
  2. 2
    Gate 1 — Admission rejection (LimitRange max)
    The first gate every pod hits is API server admission. Before the pod is even persisted to etcd, every admission controller in the cluster gets a chance to reject it. The most common rejections in production GPU clusters:
  3. 3
    Gate 2 — Scheduling stuck (FailedScheduling)
    The pod cleared admission (the API server accepted the spec, it's persisted in etcd, you can see it with kubectl get pods), but the kube-scheduler can't place it on any node. The pod sits in Pending with .spec.nodeName: "" indefinitely.
  4. 4
    Gate 3 — Image pull stuck (ImagePullBackOff)
    The pod cleared admission AND the scheduler — .spec.nodeName is set, so it's been placed on a node — but the kubelet can't pull the container image. Pod stays in ContainerCreating until the kubelet gives up and flips to ImagePullBackOff.
  5. 5
    Capstone — four broken GPU pods, four different gates
    You're on call. Pager fires. Four pods are broken — one stuck at each of the four lifecycle gates you've worked through:

Prerequisites

  • Completed earlier NCA-AIIO labs (resource-requests-limits, gpu-operator-chain, priority-preemption, storage-pvc, workload-controllers)
  • Comfortable reading `kubectl describe pod` events and tracing the pod lifecycle

Exam domains covered

AI Infrastructure & OperationsWorkload Management

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

TriagePendingContainerCreatingResourceQuotaRuntimeClassImagePullBackOffNCA-AIIOCapstoneKubernetes

What you'll learn

Every Kubernetes pod travels through five gates between kubectl apply and serving traffic: API-server admission, scheduler placement, image pull, container runtime startup, and readiness gating. A pod that's "Pending" or "ContainerCreating" is stuck at one of those five gates — but the right diagnostic command is different for each, and the failure event is buried in a different place in kubectl describe pod. This lab teaches the chain top-to-bottom, then drops you into a Triage Day where four broken GPU pods each break a different gate and you fix them by reading the cluster's response.

The diagnostic patterns covered are the most-frequent production tickets on any GPU cluster: ResourceQuota / LimitRange admission rejection, nodeSelector mismatch, ImagePullBackOff, RuntimeClass-handler-not-configured, missing PVCs, OOMKilled at runtime. Each maps to a specific kubectl describe line you can recognize at a glance — and once you can name the gate, the fix follows directly. NCA-AIIO frames these as "given this stuck pod, name the gate and the fix."

Frequently asked questions

What's the difference between Pending and ContainerCreating?

Pending means the scheduler hasn't placed the pod on a node — either because admission rejected it (quota/limits), no node satisfies the pod's requirements (nodeSelector / capacity / taints), or it's still being evaluated. ContainerCreating means the scheduler placed it (you'll see .spec.nodeName set), but the kubelet can't yet start the container — most often image pull failure, volume attach failure, or a runtime handler issue (runtimeClassName references something the kubelet doesn't know about). The right diagnostic differs: for Pending, read kubectl describe pod events for the scheduler's reason; for ContainerCreating, read the kubelet events on the same describe output and the per-container State blocks.

Why is my GPU pod stuck Pending with Insufficient nvidia.com/gpu?

The scheduler couldn't find a node with enough free nvidia.com/gpu capacity. Three layers to check: (1) the namespace's ResourceQuota — kubectl describe quota shows used vs hard caps; if requests.nvidia.com/gpu is at the cap, you'll get an admission rejection, not Pending, but the cap may rise to capacity (8 in this lab vs 4 node capacity). (2) the actual node capacity — kubectl get node -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}'; if other pods are holding all 4 GPUs, your pod is genuinely stuck. (3) preemption candidacy — your pod's PriorityClass determines whether the scheduler can evict lower-priority pods to make room; without a high enough priority, it just waits. The describe output's FailedScheduling event will name which constraint was violated.

How do I diagnose ImagePullBackOff?

kubectl describe pod <name> and read the events. The kubelet emits Failed to pull image "X": ... followed by a specific HTTP status: 404 Not Found means the tag doesn't exist (typo or never pushed), 401 Unauthorized means private registry auth failed (missing imagePullSecrets), 429 Too Many Requests means Docker Hub rate limit hit, connection refused / dial tcp means a network / DNS issue. For private registries, also kubectl get pod -o yaml | grep imagePullSecrets to confirm the secret reference is correct. ImagePullBackOff is the kubelet's backoff state after several failed attempts — fix the underlying issue and either the next backoff cycle picks it up or you kubectl delete pod to force a fresh attempt.

What's the difference between RuntimeClass nvidia not configured and missing nvidia.com/gpu request?

Two different failure shapes that both look like "no GPU access" but break at different stages. runtimeClassName: nvidia references a Kubernetes RuntimeClass; if the kubelet's runtime handler config doesn't have an nvidia entry, the kubelet rejects the pod at sandbox creation: failed to create pod sandbox: ... no runtime for "nvidia" is configured. The pod is ContainerCreating permanently. Different problem: the pod is missing resources.limits.nvidia.com/gpu: 1. Without that, the device plugin doesn't know to inject /dev/nvidia* mounts. The pod runs (sandbox is created via the default runtime), but nvidia-smi: command not found. Same ultimate symptom (no GPU), opposite location in the chain. NCA-AIIO tests this distinction directly — recognize them by the events vs the runtime logs.