Question 1

What's the difference between Pending and ContainerCreating?

Accepted Answer

`Pending` means the *scheduler* hasn't placed the pod on a node — either because admission rejected it (quota/limits), no node satisfies the pod's requirements (nodeSelector / capacity / taints), or it's still being evaluated. `ContainerCreating` means the scheduler placed it (you'll see `.spec.nodeName` set), but the kubelet can't yet start the container — most often image pull failure, volume attach failure, or a runtime handler issue (`runtimeClassName` references something the kubelet doesn't know about). The right diagnostic differs: for Pending, read `kubectl describe pod` events for the scheduler's reason; for ContainerCreating, read the kubelet events on the same describe output and the per-container `State` blocks.

Question 2

Why is my GPU pod stuck Pending with `Insufficient nvidia.com/gpu`?

Accepted Answer

The scheduler couldn't find a node with enough free `nvidia.com/gpu` capacity. Three layers to check: (1) the namespace's ResourceQuota — `kubectl describe quota` shows used vs hard caps; if `requests.nvidia.com/gpu` is at the cap, you'll get an admission rejection, not Pending, but the cap may rise to capacity (8 in this lab vs 4 node capacity). (2) the actual node capacity — `kubectl get node -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}'`; if other pods are holding all 4 GPUs, your pod is genuinely stuck. (3) preemption candidacy — your pod's PriorityClass determines whether the scheduler can evict lower-priority pods to make room; without a high enough priority, it just waits. The describe output's `FailedScheduling` event will name which constraint was violated.

Question 3

How do I diagnose ImagePullBackOff?

Accepted Answer

`kubectl describe pod <name>` and read the events. The kubelet emits `Failed to pull image "X": ...` followed by a specific HTTP status: `404 Not Found` means the tag doesn't exist (typo or never pushed), `401 Unauthorized` means private registry auth failed (missing `imagePullSecrets`), `429 Too Many Requests` means Docker Hub rate limit hit, `connection refused` / `dial tcp` means a network / DNS issue. For private registries, also `kubectl get pod -o yaml | grep imagePullSecrets` to confirm the secret reference is correct. ImagePullBackOff is the kubelet's *backoff* state after several failed attempts — fix the underlying issue and either the next backoff cycle picks it up or you `kubectl delete pod` to force a fresh attempt.

Question 4

What's the difference between RuntimeClass `nvidia` not configured and missing nvidia.com/gpu request?

Accepted Answer

Two different failure shapes that both look like "no GPU access" but break at different stages. `runtimeClassName: nvidia` references a Kubernetes RuntimeClass; if the *kubelet's* runtime handler config doesn't have an `nvidia` entry, the kubelet rejects the pod at sandbox creation: `failed to create pod sandbox: ... no runtime for "nvidia" is configured`. The pod is `ContainerCreating` permanently. Different problem: the pod is missing `resources.limits.nvidia.com/gpu: 1`. Without that, the device plugin doesn't know to inject `/dev/nvidia*` mounts. The pod runs (sandbox is created via the default runtime), but `nvidia-smi: command not found`. Same ultimate symptom (no GPU), opposite location in the chain. NCA-AIIO tests this distinction directly — recognize them by the events vs the runtime logs.

Stuck-Pending Triage Day — Diagnose Any GPU Pod That Won't Run

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll learn

Frequently asked questions

What's the difference between Pending and ContainerCreating?

Why is my GPU pod stuck Pending with `Insufficient nvidia.com/gpu`?

How do I diagnose ImagePullBackOff?

What's the difference between RuntimeClass `nvidia` not configured and missing nvidia.com/gpu request?

Stuck-Pending Triage Day — Diagnose Any GPU Pod That Won't Run

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll learn

Frequently asked questions

What's the difference between Pending and ContainerCreating?

Why is my GPU pod stuck Pending with Insufficient nvidia.com/gpu?

How do I diagnose ImagePullBackOff?

What's the difference between RuntimeClass nvidia not configured and missing nvidia.com/gpu request?

Why is my GPU pod stuck Pending with `Insufficient nvidia.com/gpu`?

What's the difference between RuntimeClass `nvidia` not configured and missing nvidia.com/gpu request?