Inside the NVIDIA GPU Operator — From Helm to Workload-Ready
Hosted
Beta

Inside the NVIDIA GPU Operator — From Helm to Workload-Ready

Walk the chain that turns a Helm install into the `nvidia.com/gpu` your workload requests. Inspect the cluster the way platform engineers do — node labels, capacity, RuntimeClass — and learn to attribute every piece of evidence to the GPU Operator component that produced it. Finishes with a Triage Day where three broken GPU pods each break a different chain link.

35 min·4 steps·2 domains·Intermediate·nca-aiioncp-aiincp-aio

What you'll learn

  1. 1
    Anatomy of the GPU Operator chain
    The NVIDIA GPU Operator is a single Helm install that deploys a half-dozen cooperating components. Once installed, you don't *interact* with it directly — you interact with the *evidence* it leaves on your cluster: node labels, Capacity values, a RuntimeClass, Prometheus metrics endpoints. As a platform engineer, your job is to read that evidence and trace each piece back to the component that produced it. That's the muscle memory this lab builds.
  2. 2
    gpu-feature-discovery — labels for GPU-aware scheduling
    In step 1 you saw a stack of nvidia.com.* labels on the node. They're not decoration — they're a scheduling API. Every label is something a workload manifest can target via nodeSelector or nodeAffinity to say "I need *this kind* of GPU." Understanding them is the difference between "any GPU will do" workloads and production deployments where a fp16 inference job has to land on a specific generation of hardware.
  3. 3
    RuntimeClass and the runtime path
    You've now seen what gpu-feature-discovery and the device plugin contribute. There's a third critical piece of the chain: RuntimeClass. Without it, even a perfectly-scheduled pod with nvidia.com/gpu: 1 runs without GPU access — the kernel's /dev/nvidia* files aren't mounted, nvidia-smi isn't on PATH, CUDA fails on first call.
  4. 4
    Triage Day — three pods, three broken chain links
    You've now seen each link in the chain individually. Time to use it. Three GPU pods have just been deployed into your cluster — each one is broken at a *different* link. Your job: read the cluster, attribute each symptom to a chain link, fix all three.

Prerequisites

  • Completed `nca-aiio-resource-requests-limits` (or comfortable with kubectl + GPU pod manifests)
  • Familiar with `kubectl describe`, events, and Pending/ContainerCreating diagnosis

Exam domains covered

AI Infrastructure & OperationsGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

GPU Operatorgpu-feature-discoverydevice pluginRuntimeClassdcgm-exporterNCA-AIIOTriageKubernetes

What you'll learn

The NVIDIA GPU Operator is the single most important piece of the NCA-AIIO certification's infrastructure section, and the most common source of confusion when a GPU workload won't schedule. This lab teaches it from the angle that actually matters in production: not the Helm install, but the evidence the Operator leaves on your cluster — the node labels written by gpu-feature-discovery, the Capacity contributed by the device plugin, the RuntimeClass that selects the NVIDIA Container Runtime, the metrics emitted by dcgm-exporter. You learn to attribute every piece of that evidence to the component that produced it, then walk three broken GPU pods backward through the chain to find the source of the failure.

The lab finishes with a Triage Day where three GPU pods are deliberately broken — one in scheduling, one in runtime selection, one in the request shape — and you fix each one by reading the cluster's response. This is the daily work of an NVIDIA platform engineer and the exact pattern the NCA-AIIO exam tests under the AI Infrastructure & Operations domain.

Frequently asked questions

What components does the NVIDIA GPU Operator deploy?

The GPU Operator is a Helm-installed bundle that deploys: the NVIDIA driver (or assumes a host-level install), the NVIDIA Container Toolkit (modifies OCI specs to inject GPU mounts), the device plugin (advertises nvidia.com/gpu capacity to the kubelet), gpu-feature-discovery (writes nvidia.com/gpu.product, nvidia.com/gpu.memory, etc. node labels), dcgm-exporter (Prometheus metrics for utilization, errors, and temperature), the MIG manager (when MIG is enabled), and a validator (post-install sanity check). The lab walks through what each one contributes to the cluster you can see with kubectl, so you can debug workloads by reading those contributions rather than by SSHing into operator pods.

What's a RuntimeClass and how does it relate to GPUs?

RuntimeClass is the Kubernetes API for selecting a non-default container runtime. The kubelet maps each RuntimeClass to a runtime handler defined in its config — the nvidia RuntimeClass typically maps to the NVIDIA Container Runtime, which intercepts container starts to inject GPU device files (/dev/nvidia0, /dev/nvidiactl) and the libcuda libraries. Without runtimeClassName: nvidia, even a pod that requests nvidia.com/gpu: 1 runs under the default runtime (runc) and never gets GPU access — nvidia-smi returns "command not found." This is one of the most common GPU-pod misconfigurations in production.

Why does my GPU pod stay Pending with 'didn't match Pod's node affinity/selector'?

Because some node label your pod requested doesn't exist on any node. The most common variant: nodeSelector: nvidia.com/gpu.product: NVIDIA-A100 when your fleet only has Tesla-K80s. gpu-feature-discovery writes whatever the actual hardware reports; if your nodeSelector targets a different model, the scheduler can't find a candidate node. The fix is to either match the actual nvidia.com/gpu.product value or remove the selector entirely. Run kubectl get node -o jsonpath='{.items[*].metadata.labels}' | tr ',' '\n' | grep nvidia to see what's actually labeled.