NVIDIA GPU Operator on k3s: Single-Node Kubernetes for GPU Workloads
GPU sandbox · jupyter
Beta

NVIDIA GPU Operator on k3s: Single-Node Kubernetes for GPU Workloads

Bring up a lightweight single-node Kubernetes cluster with the NVIDIA GPU Operator — k3s install, containerd wiring, Helm values, workload manifests with RBAC and ResourceQuota, plus a full runbook (validation plan, troubleshooting matrix, day-2 ops).

40 min·4 steps·2 domains·Intermediate·ncp-aioncp-ainncp-aii

What you'll learn

  1. 1
    Install k3s + register the NVIDIA runtime
  2. 2
    Install the NVIDIA GPU Operator via Helm
  3. 3
    Workload manifests: Namespace, RuntimeClass, Pod, RBAC, ResourceQuota
  4. 4
    The runbook: validation, troubleshooting, day-2 ops

Prerequisites

  • Kubernetes basics — Pods, Deployments, Namespaces, RBAC
  • Helm charts and values.yaml overrides
  • Linux CLI, systemd, containerd fundamentals

Exam domains covered

AI Infrastructure & OperationsGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

Kubernetesk3sGPU OperatorHelmcontainerdRuntimeClassResourceQuotadcgm-exporter

What you'll build in this k3s + NVIDIA GPU Operator lab

Running GPUs on Kubernetes is where AI platform teams earn their keep — the NVIDIA GPU Operator is the glue, and understanding its chain (host driver → containerd → NVIDIA container toolkit → device plugin → NFD → dcgm-exporter → your pod) separates operators who can debug 0/1 nodes available: insufficient nvidia.com/gpu in five minutes from teams that lose an afternoon to it. This lab gives you a complete, production-shaped artifact set for a single-node k3s cluster: a k3s install script with the right INSTALL_K3S_EXEC flags, a containerd runtime snippet registering nvidia-container-runtime, a gpu-operator Helm values.yaml tuned for k3s, workload manifests with RuntimeClass and ResourceQuota, and a runbook with validation, troubleshooting, and day-2 ops. Roughly 40 minutes on a real NVIDIA GPU pod we provision — no local cluster, no driver install, no CNI wrangling.

The k3s + GPU Operator pairing is a specific production choice and the lab treats it that way. You'll set driver.enabled: false because the host already has the driver (the right default on dev boxes and cloud GPU VMs), point toolkit.env at /var/lib/rancher/k3s/agent/etc/containerd — not /etc/containerd/ — because k3s ships its own containerd at a non-standard path, enable NFD + device plugin + dcgm-exporter, and define a RuntimeClass named nvidia so pods explicitly opt into the GPU runtime. The ResourceQuota on requests.nvidia.com/gpu is the multi-tenant guardrail that stops one namespace monopolising the card. The runbook walks the layered chain bottom-up — host nvidia-smi → containerd config on disk → operator pods healthy → NFD labels → device plugin advertising nvidia.com/gpu → workload runtimeClassName — because every mysterious Pending pod on a GPU cluster is a failure somewhere in that chain, and the fastest path to root cause is always bottom-up.

Prereqs: Kubernetes fundamentals (Pods, Deployments, Namespaces, RBAC), Helm values overrides, and Linux basics (systemd, containerd, TOML config). Preinstalled on the lab pod: k3s, Helm 3, kubectl, the NVIDIA driver, containerd, PyYAML for the grader. The notebook is a writing exercise — grading parses your YAML and asserts structural correctness (multi-doc manifests, required fields, canonical MIG/containerd paths, runbook field schema), so you practice the artifacts exactly as you'd ship them to a real cluster.

Frequently asked questions

Why k3s instead of full upstream Kubernetes for a GPU lab?

Because k3s is the shortest path to a working single-node cluster on a dev box — one binary, one curl-to-install, sensible defaults, and everything runs under one systemd unit. Upstream Kubernetes needs kubeadm, a separate CRI, separate CNI setup, and a lot more to go right on the first try. For production multi-node GPU clusters you'd reach for RKE2, OpenShift, or managed EKS/AKS/GKE, but for learning the operator-and-runtime chain, k3s is the right default.

Why set driver.enabled: false in the GPU Operator values?

Because the lab assumes the NVIDIA driver is already installed on the host (which is how most dev boxes and cloud GPU VMs ship). The GPU Operator can install and manage the driver itself in a container, but doing that requires unloading the running driver first and is the wrong default on a machine you use for other things. driver.enabled: false tells the operator to trust the host driver and only manage the pieces above it: toolkit, device plugin, NFD, dcgm-exporter.

Why does the toolkit.env path need to point at /var/lib/rancher/k3s/agent/etc/containerd?

Because k3s ships its own containerd and puts its config at a non-standard path under /var/lib/rancher/k3s/agent/etc/containerd/config.toml, generated from config.toml.tmpl. The NVIDIA container toolkit's job is to patch the containerd config so the nvidia runtime is registered — if it patches the standard /etc/containerd/config.toml (which is what the default operator values target), k3s never reads those changes and the GPU resource is never advertised to the device plugin.

What does the RuntimeClass named nvidia actually do?

It lets Kubernetes select a specific container runtime on a per-pod basis. When a pod sets runtimeClassName: nvidia, kubelet tells containerd to start the container using the nvidia runtime entry (the one you configured in Step 1), which wraps runc with the NVIDIA container toolkit and injects the GPU device nodes into the container. Without the RuntimeClass, pods get the default runtime and the GPU isn't visible — even if nvidia.com/gpu: 1 is requested correctly.

Why is 0/1 nodes available: insufficient nvidia.com/gpu almost never a workload bug?

Because that message means the device plugin never advertised the resource to the Kubernetes scheduler at all. The fix is in the lower layers: the toolkit DaemonSet didn't patch containerd, or the patch landed at the wrong path, or the driver isn't loaded, or NFD didn't label the node. The reflection step walks you bottom-up through the runbook: host nvidia-smi → containerd config on disk → operator pods healthy → NFD labels on the node → device plugin advertising the resource. The workload's runtimeClassName is the last thing to check, not the first.

Why cap GPUs with a ResourceQuota if the node only has a fixed number anyway?

Because namespace-level ResourceQuotas prevent one tenant (a team, a project, a CI pipeline) from monopolizing the cluster's GPUs. Without a quota, a misbehaving workload that requests nvidia.com/gpu: 4 can leave every other namespace waiting. With spec.hard.requests.nvidia.com/gpu: 2 you pin each namespace's budget, and the scheduler will reject over-quota requests at admission time rather than letting them pile up Pending. It's cheap insurance against noisy neighbors even on a single-node dev cluster.