Question 1

What components does the NVIDIA GPU Operator deploy?

Accepted Answer

The GPU Operator is a Helm-installed bundle that deploys: the NVIDIA driver (or assumes a host-level install), the NVIDIA Container Toolkit (modifies OCI specs to inject GPU mounts), the device plugin (advertises `nvidia.com/gpu` capacity to the kubelet), gpu-feature-discovery (writes `nvidia.com/gpu.product`, `nvidia.com/gpu.memory`, etc. node labels), dcgm-exporter (Prometheus metrics for utilization, errors, and temperature), the MIG manager (when MIG is enabled), and a validator (post-install sanity check). The lab walks through what each one *contributes* to the cluster you can see with kubectl, so you can debug workloads by reading those contributions rather than by SSHing into operator pods.

Question 2

What's a RuntimeClass and how does it relate to GPUs?

Accepted Answer

RuntimeClass is the Kubernetes API for selecting a non-default container runtime. The kubelet maps each RuntimeClass to a runtime handler defined in its config — the `nvidia` RuntimeClass typically maps to the NVIDIA Container Runtime, which intercepts container starts to inject GPU device files (`/dev/nvidia0`, `/dev/nvidiactl`) and the libcuda libraries. Without `runtimeClassName: nvidia`, even a pod that requests `nvidia.com/gpu: 1` runs under the default runtime (runc) and never gets GPU access — `nvidia-smi` returns "command not found." This is one of the most common GPU-pod misconfigurations in production.

Question 3

Why does my GPU pod stay Pending with 'didn't match Pod's node affinity/selector'?

Accepted Answer

Because some node label your pod requested doesn't exist on any node. The most common variant: `nodeSelector: nvidia.com/gpu.product: NVIDIA-A100` when your fleet only has Tesla-K80s. gpu-feature-discovery writes whatever the actual hardware reports; if your nodeSelector targets a different model, the scheduler can't find a candidate node. The fix is to either match the actual `nvidia.com/gpu.product` value or remove the selector entirely. Run `kubectl get node -o jsonpath='{.items[*].metadata.labels}' | tr ',' '
' | grep nvidia` to see what's actually labeled.

Inside the NVIDIA GPU Operator — From Helm to Workload-Ready

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll learn

Frequently asked questions

What components does the NVIDIA GPU Operator deploy?

What's a RuntimeClass and how does it relate to GPUs?

Why does my GPU pod stay Pending with 'didn't match Pod's node affinity/selector'?