Question 1

Why target `runs-on: [self-hosted, gpu]` instead of a GitHub-hosted runner?

Accepted Answer

GitHub-hosted runners don't have NVIDIA GPUs — your `docker run --gpus all` test step would either skip or silently pass. You either register a self-hosted runner on a GPU host, use a managed GPU CI provider (BuildJet, Actuated, Namespace), or rely on GitHub's large-runner GPU tier where available. The lab shows the `self-hosted, gpu` pattern because it's the most portable and because it teaches you to separate the 'where does my image build' question from the 'where does my image test' question — in production those often run on very different hardware.

Question 2

Should the `HEALTHCHECK` in the Dockerfile call `curl http://localhost:8000/healthz`?

Accepted Answer

No — that's the mistake this lab is specifically designed to correct. A successful HTTP 200 from your server process proves Python is running; it proves nothing about the GPU. Dockerfile `HEALTHCHECK` should invoke a GPU-aware probe — `python3 /app/smoke_test.py` that calls `torch.cuda.is_available()`, initializes NVML, and allocates a small tensor — so Docker marks the container unhealthy when the driver, the toolkit, or the card itself fail. Your Kubernetes readiness probe reuses the exact same script.

Question 3

Why distinguish readiness from liveness when both call the same smoke test?

Accepted Answer

Readiness answers 'should this pod receive traffic right now?' and failure simply pulls the pod out of the Service endpoint list — no restart. Liveness answers 'is this pod permanently wedged?' and failure triggers a kubelet restart. Using the same script is fine; using the same thresholds is a bug. Readiness should fail fast (a few seconds) during startup while the model warms up; liveness should fail slow (tens of seconds, with a tolerant `failureThreshold`) so transient Xid errors or a briefly-stuck kernel don't thrash your pods into a `CrashLoopBackOff`.

Question 4

What can go wrong at runtime that CI and `HEALTHCHECK` won't catch?

Accepted Answer

Plenty. Xid fatal errors after hours of load. Memory fragmentation that only surfaces when a larger-than-usual batch arrives. Thermal throttling that turns into a kernel timeout. cuDNN version drift between warm-up and real traffic. Driver hangs after a specific sequence of CUDA API calls. Model shard OOMs under production-shaped inputs. CI runs briefly on a cold GPU; real traffic is long, hot, and adversarial. That's exactly why the livenessProbe reruns the smoke test continuously and why you'd graduate to a richer probe that exercises a real model forward pass.

Question 5

Why use `maxUnavailable: 0, maxSurge: 1` for the rolling update?

Accepted Answer

GPU pods are expensive and cluster GPU capacity is usually the binding constraint. `maxUnavailable: 0` guarantees the old pod keeps serving traffic until the new one passes readiness, so you never drop below full capacity. `maxSurge: 1` says you're willing to temporarily run one extra replica during the rollout — critical because without surge, with `maxUnavailable: 0`, the rollout would deadlock waiting for a pod that can't start until another pod dies. The combination gives you zero-downtime deploys on constrained GPU nodes.

Question 6

How does grading work for the Kubernetes step?

Accepted Answer

The validator parses your `deployment_yaml` string and checks for the required fields: `apiVersion`, `kind: Deployment`, `strategy.rollingUpdate` with both `maxUnavailable` and `maxSurge`, `nvidia.com/gpu` in resource limits, and both `readinessProbe` and `livenessProbe`. It also inspects your `rollback_commands` list for `kubectl rollout history` and `undo`, and verifies your `image_promotion_flow` defines at least three stages (`dev → staging → prod`) each with `env`, `tag_pattern`, and `gate` keys. Nothing is applied to a live cluster — the lab grades the artifacts, not a running rollout.

GPU Container Lifecycle: Build, Test, Ship, Rollback

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU container lifecycle lab

Frequently asked questions

Why target `runs-on: [self-hosted, gpu]` instead of a GitHub-hosted runner?

Should the `HEALTHCHECK` in the Dockerfile call `curl http://localhost:8000/healthz`?

Why distinguish readiness from liveness when both call the same smoke test?

What can go wrong at runtime that CI and `HEALTHCHECK` won't catch?

Why use `maxUnavailable: 0, maxSurge: 1` for the rolling update?

How does grading work for the Kubernetes step?

GPU Container Lifecycle: Build, Test, Ship, Rollback

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this GPU container lifecycle lab

Frequently asked questions

Why target runs-on: [self-hosted, gpu] instead of a GitHub-hosted runner?

Should the HEALTHCHECK in the Dockerfile call curl http://localhost:8000/healthz?

Why distinguish readiness from liveness when both call the same smoke test?

What can go wrong at runtime that CI and HEALTHCHECK won't catch?

Why use maxUnavailable: 0, maxSurge: 1 for the rolling update?

How does grading work for the Kubernetes step?

Why target `runs-on: [self-hosted, gpu]` instead of a GitHub-hosted runner?

Should the `HEALTHCHECK` in the Dockerfile call `curl http://localhost:8000/healthz`?

What can go wrong at runtime that CI and `HEALTHCHECK` won't catch?

Why use `maxUnavailable: 0, maxSurge: 1` for the rolling update?