Persistent Storage for AI Workloads — PVCs, StorageClass & the Checkpoint Pattern
Hosted
Beta

Persistent Storage for AI Workloads — PVCs, StorageClass & the Checkpoint Pattern

Stop losing your training checkpoints when pods restart. Learn the PersistentVolumeClaim model end-to-end — StorageClass selection, accessModes (and the RWO-is-per-node trap), the bind/mount/persist lifecycle, and three triage scenarios where storage chains break.

35 min·5 steps·2 domains·Intermediate·nca-aiioncp-aiincp-aio

What you'll learn

  1. 1
    The ephemeral storage problem — why pods need PVCs
    A pod's filesystem is ephemeral by default. Anything written to /tmp, /data, /checkpoints, or any other path inside the container lives in the container runtime's *writable layer* — a copy-on-write overlay on top of the image. When the pod is deleted, the layer is destroyed. When the *container* restarts inside an existing pod (e.g., from a crash), the writable layer is also discarded; you come back to the image's original state.
  2. 2
    Author your first PVC and bind it to a pod
    Your task: create a PersistentVolumeClaim for a 1Gi training-checkpoints volume, plus a writer pod that mounts it at /data and drops a marker file. By the end you'll have a Bound PVC and a Running pod that's writing to durable storage.
  3. 3
    Persistence across pod restarts — the checkpoint pattern
    The whole point of using a PVC is that the storage outlives the pod. The writer pod from step 2 has been cleaned up (its marker.txt is the last thing it wrote to /data). The PVC checkpoint-pvc is still Bound — and the underlying disk still has marker.txt on it.
  4. 4
    AccessModes — RWO is per-node, RWOP is per-pod
    This is the step where most engineers' mental model of accessModes breaks. The lesson is short and counter-intuitive:
  5. 5
    Triage Day — three pods broken at three different storage layers
    The lab platform has just deployed three pods that should be Running but aren't. Each one is broken at a *different link* in the storage chain — Pod → PVC → StorageClass — so each needs a different diagnostic flow. By the end you'll have all three Running and the cluster will look healthy.

Prerequisites

  • Completed `nca-aiio-resource-requests-limits` (or comfortable with kubectl + pod manifests)
  • Familiar with `kubectl describe`, reading pod events, and the get/edit/delete/apply pattern

Exam domains covered

AI Infrastructure & OperationsStorage & Data Management

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

PersistentVolumeClaimStorageClassPersistentVolumeAccessModesReadWriteOnceReadWriteOncePodStorageNCA-AIIOKubernetesTriage

What you'll learn

Persistent storage is what separates Kubernetes pods from disposable processes. Without it, your training checkpoints, your model weights, your dataset caches all evaporate at the next pod restart. This lab teaches the PersistentVolumeClaim model the way platform engineers actually use it: as a claim (your declarative request) that the cluster's StorageClass resolves into a PersistentVolume (the actual disk). You'll author claims, read the binding state machine, mount volumes into pods, and test the survival of data across pod deletions.

The lab also covers what trips engineers up most: the three accessModes (ReadWriteOnce, ReadWriteOncePod, ReadWriteMany) and the surprise that ReadWriteOnce is per node, not per pod — which means two pods on the same node CAN share an RWO volume, and your assumption that "RWO blocks scale-out" is probably wrong. You finish with a Triage Day where three pods break at three different links in the storage chain (PVC reference / accessMode collision / missing StorageClass), and you fix each one by reading the cluster's response.

Frequently asked questions

What's the difference between PV, PVC, and StorageClass?

A PersistentVolume (PV) is a piece of storage in the cluster — a disk, a network mount, a cloud volume — represented as an API object. A PersistentVolumeClaim (PVC) is a request for storage written by a pod author: "I need 5Gi, ReadWriteOnce, on this StorageClass." A StorageClass describes how to dynamically provision PVs that satisfy claims — its provisioner field names the CSI driver, and parameters are passed through to that driver. The flow: pod author writes PVC → cluster matches a PV (or asks the StorageClass's provisioner to create one) → PVC binds to PV → pod's volumes.persistentVolumeClaim.claimName mounts the PV inside the pod's filesystem. Most production clusters use dynamic provisioning (PVCs trigger PV creation); static PVs are rare and usually for legacy / shared infrastructure.

What does ReadWriteOnce actually mean?

ReadWriteOnce (RWO) means the volume can be mounted read-write by one node at a timenot one pod. This is the field that bites engineers most often. On a single-node cluster (or when your scheduler co-locates pods), two pods can share an RWO volume just fine; the kernel sees a single mount and both pods write through the same filesystem. RWO blocks pods on different nodes from attaching the same volume simultaneously. If you actually need single-pod-only access, use ReadWriteOncePod (Kubernetes 1.22+), which the scheduler enforces by refusing to schedule a second pod that references the same PVC. For multi-node concurrent access, use ReadWriteMany (RWX), which requires a StorageClass backed by a network filesystem (NFS, CephFS, EFS, FSx, GlusterFS) — block-device storage classes (most cloud disks, local-path) cannot satisfy RWX.

Why is my PVC stuck Pending?

Three common reasons and how to tell which one. (1) volumeBindingMode: WaitForFirstConsumer — the PVC won't bind until a pod references it. Look at kubectl describe pvc events for "WaitForFirstConsumer" — that's normal and resolves when the pod is created. (2) The StorageClass's provisioner can't satisfy the claim — wrong accessMode (asking RWX from a block-device class), oversized request, or the provisioner's CSI controller is down. Look at events for "ProvisioningFailed". (3) The named storageClassName doesn't exist in the cluster. kubectl get storageclass to confirm what's available; the default class is whichever has is-default-class: "true" in annotations. The PVC's events are the diagnostic — kubectl describe pvc <name> shows whether the cluster has tried to bind, why it failed, and whether it's still waiting.

What happens to my data when I delete the PVC?

The StorageClass's reclaimPolicy decides: Delete (the default for most dynamic provisioners) tears down the underlying PV and the storage backend's actual data — your training checkpoints are gone. Retain leaves the PV in Released state with the data intact, but the PV won't be re-bindable to a new PVC without manual intervention (you must kubectl edit it to clear the claimRef field). For production, the safe default is to author StorageClasses with reclaimPolicy: Retain for any data you can't afford to lose, and have a backup mechanism on top (snapshots, velero, application-level checkpointing to object storage). For ephemeral caches and scratch space, Delete is correct — you want the disk to come back when the experiment ends.