Question 1

Why k3s instead of full upstream Kubernetes for a GPU lab?

Accepted Answer

Because k3s is the shortest path to a working single-node cluster on a dev box — one binary, one curl-to-install, sensible defaults, and everything runs under one systemd unit. Upstream Kubernetes needs kubeadm, a separate CRI, separate CNI setup, and a lot more to go right on the first try. For production multi-node GPU clusters you'd reach for RKE2, OpenShift, or managed EKS/AKS/GKE, but for learning the operator-and-runtime chain, k3s is the right default.

Question 2

Why set `driver.enabled: false` in the GPU Operator values?

Accepted Answer

Because the lab assumes the NVIDIA driver is already installed on the host (which is how most dev boxes and cloud GPU VMs ship). The GPU Operator can install and manage the driver itself in a container, but doing that requires unloading the running driver first and is the wrong default on a machine you use for other things. `driver.enabled: false` tells the operator to trust the host driver and only manage the pieces above it: toolkit, device plugin, NFD, dcgm-exporter.

Question 3

Why does the `toolkit.env` path need to point at `/var/lib/rancher/k3s/agent/etc/containerd`?

Accepted Answer

Because k3s ships its own containerd and puts its config at a non-standard path under `/var/lib/rancher/k3s/agent/etc/containerd/config.toml`, generated from `config.toml.tmpl`. The NVIDIA container toolkit's job is to patch the containerd config so the `nvidia` runtime is registered — if it patches the standard `/etc/containerd/config.toml` (which is what the default operator values target), k3s never reads those changes and the GPU resource is never advertised to the device plugin.

Question 4

What does the RuntimeClass named `nvidia` actually do?

Accepted Answer

It lets Kubernetes select a specific container runtime on a per-pod basis. When a pod sets `runtimeClassName: nvidia`, kubelet tells containerd to start the container using the `nvidia` runtime entry (the one you configured in Step 1), which wraps runc with the NVIDIA container toolkit and injects the GPU device nodes into the container. Without the RuntimeClass, pods get the default runtime and the GPU isn't visible — even if `nvidia.com/gpu: 1` is requested correctly.

Question 5

Why is `0/1 nodes available: insufficient nvidia.com/gpu` almost never a workload bug?

Accepted Answer

Because that message means the device plugin never advertised the resource to the Kubernetes scheduler at all. The fix is in the lower layers: the toolkit DaemonSet didn't patch containerd, or the patch landed at the wrong path, or the driver isn't loaded, or NFD didn't label the node. The reflection step walks you bottom-up through the runbook: host `nvidia-smi` → containerd config on disk → operator pods healthy → NFD labels on the node → device plugin advertising the resource. The workload's runtimeClassName is the last thing to check, not the first.

Question 6

Why cap GPUs with a ResourceQuota if the node only has a fixed number anyway?

Accepted Answer

Because namespace-level ResourceQuotas prevent one tenant (a team, a project, a CI pipeline) from monopolizing the cluster's GPUs. Without a quota, a misbehaving workload that requests `nvidia.com/gpu: 4` can leave every other namespace waiting. With `spec.hard.requests.nvidia.com/gpu: 2` you pin each namespace's budget, and the scheduler will reject over-quota requests at admission time rather than letting them pile up Pending. It's cheap insurance against noisy neighbors even on a single-node dev cluster.

NVIDIA GPU Operator on k3s: Single-Node Kubernetes for GPU Workloads

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this k3s + NVIDIA GPU Operator lab

Frequently asked questions

Why k3s instead of full upstream Kubernetes for a GPU lab?

Why set `driver.enabled: false` in the GPU Operator values?

Why does the `toolkit.env` path need to point at `/var/lib/rancher/k3s/agent/etc/containerd`?

What does the RuntimeClass named `nvidia` actually do?

Why is `0/1 nodes available: insufficient nvidia.com/gpu` almost never a workload bug?

Why cap GPUs with a ResourceQuota if the node only has a fixed number anyway?

NVIDIA GPU Operator on k3s: Single-Node Kubernetes for GPU Workloads

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this k3s + NVIDIA GPU Operator lab

Frequently asked questions

Why k3s instead of full upstream Kubernetes for a GPU lab?

Why set driver.enabled: false in the GPU Operator values?

Why does the toolkit.env path need to point at /var/lib/rancher/k3s/agent/etc/containerd?

What does the RuntimeClass named nvidia actually do?

Why is 0/1 nodes available: insufficient nvidia.com/gpu almost never a workload bug?

Why cap GPUs with a ResourceQuota if the node only has a fixed number anyway?

Why set `driver.enabled: false` in the GPU Operator values?

Why does the `toolkit.env` path need to point at `/var/lib/rancher/k3s/agent/etc/containerd`?

What does the RuntimeClass named `nvidia` actually do?

Why is `0/1 nodes available: insufficient nvidia.com/gpu` almost never a workload bug?