NVIDIA GPU Operator on k3s: Single-Node Kubernetes for GPU Workloads
Bring up a lightweight single-node Kubernetes cluster with the NVIDIA GPU Operator — k3s install, containerd wiring, Helm values, workload manifests with RBAC and ResourceQuota, plus a full runbook (validation plan, troubleshooting matrix, day-2 ops).
What you'll learn
- 1Install k3s + register the NVIDIA runtime
- 2Install the NVIDIA GPU Operator via Helm
- 3Workload manifests: Namespace, RuntimeClass, Pod, RBAC, ResourceQuota
- 4The runbook: validation, troubleshooting, day-2 ops
Prerequisites
- Kubernetes basics — Pods, Deployments, Namespaces, RBAC
- Helm charts and values.yaml overrides
- Linux CLI, systemd, containerd fundamentals
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this k3s + NVIDIA GPU Operator lab
Running GPUs on Kubernetes is where AI platform teams earn their keep — the NVIDIA GPU Operator is the glue, and understanding its chain (host driver → containerd → NVIDIA container toolkit → device plugin → NFD → dcgm-exporter → your pod) separates operators who can debug 0/1 nodes available: insufficient nvidia.com/gpu in five minutes from teams that lose an afternoon to it. This lab gives you a complete, production-shaped artifact set for a single-node k3s cluster: a k3s install script with the right INSTALL_K3S_EXEC flags, a containerd runtime snippet registering nvidia-container-runtime, a gpu-operator Helm values.yaml tuned for k3s, workload manifests with RuntimeClass and ResourceQuota, and a runbook with validation, troubleshooting, and day-2 ops. Roughly 40 minutes on a real NVIDIA GPU pod we provision — no local cluster, no driver install, no CNI wrangling.
The k3s + GPU Operator pairing is a specific production choice and the lab treats it that way. You'll set driver.enabled: false because the host already has the driver (the right default on dev boxes and cloud GPU VMs), point toolkit.env at /var/lib/rancher/k3s/agent/etc/containerd — not /etc/containerd/ — because k3s ships its own containerd at a non-standard path, enable NFD + device plugin + dcgm-exporter, and define a RuntimeClass named nvidia so pods explicitly opt into the GPU runtime. The ResourceQuota on requests.nvidia.com/gpu is the multi-tenant guardrail that stops one namespace monopolising the card. The runbook walks the layered chain bottom-up — host nvidia-smi → containerd config on disk → operator pods healthy → NFD labels → device plugin advertising nvidia.com/gpu → workload runtimeClassName — because every mysterious Pending pod on a GPU cluster is a failure somewhere in that chain, and the fastest path to root cause is always bottom-up.
Prereqs: Kubernetes fundamentals (Pods, Deployments, Namespaces, RBAC), Helm values overrides, and Linux basics (systemd, containerd, TOML config). Preinstalled on the lab pod: k3s, Helm 3, kubectl, the NVIDIA driver, containerd, PyYAML for the grader. The notebook is a writing exercise — grading parses your YAML and asserts structural correctness (multi-doc manifests, required fields, canonical MIG/containerd paths, runbook field schema), so you practice the artifacts exactly as you'd ship them to a real cluster.
Frequently asked questions
Why k3s instead of full upstream Kubernetes for a GPU lab?
Why set driver.enabled: false in the GPU Operator values?
driver.enabled: false in the GPU Operator values?driver.enabled: false tells the operator to trust the host driver and only manage the pieces above it: toolkit, device plugin, NFD, dcgm-exporter.Why does the toolkit.env path need to point at /var/lib/rancher/k3s/agent/etc/containerd?
toolkit.env path need to point at /var/lib/rancher/k3s/agent/etc/containerd?/var/lib/rancher/k3s/agent/etc/containerd/config.toml, generated from config.toml.tmpl. The NVIDIA container toolkit's job is to patch the containerd config so the nvidia runtime is registered — if it patches the standard /etc/containerd/config.toml (which is what the default operator values target), k3s never reads those changes and the GPU resource is never advertised to the device plugin.What does the RuntimeClass named nvidia actually do?
nvidia actually do?runtimeClassName: nvidia, kubelet tells containerd to start the container using the nvidia runtime entry (the one you configured in Step 1), which wraps runc with the NVIDIA container toolkit and injects the GPU device nodes into the container. Without the RuntimeClass, pods get the default runtime and the GPU isn't visible — even if nvidia.com/gpu: 1 is requested correctly.Why is 0/1 nodes available: insufficient nvidia.com/gpu almost never a workload bug?
0/1 nodes available: insufficient nvidia.com/gpu almost never a workload bug?nvidia-smi → containerd config on disk → operator pods healthy → NFD labels on the node → device plugin advertising the resource. The workload's runtimeClassName is the last thing to check, not the first.Why cap GPUs with a ResourceQuota if the node only has a fixed number anyway?
nvidia.com/gpu: 4 can leave every other namespace waiting. With spec.hard.requests.nvidia.com/gpu: 2 you pin each namespace's budget, and the scheduler will reject over-quota requests at admission time rather than letting them pile up Pending. It's cheap insurance against noisy neighbors even on a single-node dev cluster.