Question 1

Why use the MLflow Model Registry instead of just pointing at a run_id?

Accepted Answer

Because run_ids change every retrain and production code can't track them. The Registry gives you stable URIs like `models:/chat-intent-classifier/Production` that resolve to whichever version is currently promoted. When a new model wins the evaluation gate, you call `transition_model_version_stage(..., stage='Production')` and serving jobs pick it up without a config change. That's the atomic promotion the lab teaches in Step 2 — run_ids stay in Slack and docs, URIs live in the serving config.

Question 2

What's the point of `mlflow.autolog()` if I still need explicit `log_metric`?

Accepted Answer

Autolog captures the things every framework user wants — training/validation loss, hyperparameters from the model constructor, epoch summaries, framework version, model artifact — without a single explicit call. It's the 90% case. You still call `log_metric` for domain metrics (a business KPI, a calibration score, a safety-eval pass rate) because those aren't in any framework's default set. The lab has you author `autologging_config` in Step 4 so you enumerate which flavors to enable (`pytorch`, `sklearn`, `transformers`) and whether to log input examples.

Question 3

Why `--backend-store-uri` Postgres + `--artifacts-destination` S3 in production?

Accepted Answer

The backend store holds metadata — runs, params, metrics, registry state — and it's read-heavy with transactional writes, which Postgres handles well. Artifacts (model binaries, input examples, plots) are large blobs better served from object storage. The file-URI backend you use in Step 1 is fine for a single notebook but falls over the moment a second engineer tries to list runs. The Step 4 `mlflow_server_cmd` forces you to write the real CLI with both URIs so you've seen the production layout.

Question 4

What tags should I actually enforce?

Accepted Answer

At minimum: `team` (who owns it), `owner` (individual accountable), `env` (dev/staging/prod), `git_commit` (exact source version), and `model_family` (classifier, retriever, embedder, LLM fine-tune, etc.). Step 4 requires `team` and `owner` specifically; the reflection at the end asks which tag would let a new engineer six months later distinguish 'the best chat-intent classifier from last quarter' from every other classifier on the server. That's `model_family`, and without it every search is a text match against titles.

Question 5

Can I use MLflow for training on Kubernetes without the Tracking Server running as a K8s service?

Accepted Answer

Yes — your training `Job` just needs `MLFLOW_TRACKING_URI` pointing at a reachable server (the one started by `mlflow server ...`), and it'll POST runs over HTTP. The Step 4 `k8s_training_job` YAML forces that injection as an env var, plus `nvidia.com/gpu` resource request and `restartPolicy:` for the Job semantics. The server itself typically runs as a separate Deployment behind a Service — outside this lab's scope, but a common pairing.

Question 6

What does the grader validate on each step?

Accepted Answer

Step 1 requires `run_info.status == 'FINISHED'`, `logged_params` contains `learning_rate`/`batch_size`/`optimizer`, `logged_metrics['train_loss']` has ≥5 `(step, value)` tuples, and last loss < first loss. Step 2 validates `model_artifact_path` is a real MLflow URI, `registered_model` has `name`/`latest_version`, `version_stage` is one of the valid stage names, and `reload_prediction_shape` is a non-empty tuple. Step 3 enforces `len(all_runs_df) >= 3` and `best_run.final_loss` matches the DataFrame minimum. Step 4 checks the server CLI flags, K8s Job primitives, required tags, and `autologging_config` keys.

MLflow Experiment Tracking: From Single Run to Team Workflow

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll wire up in this MLflow lab

Frequently asked questions

Why use the MLflow Model Registry instead of just pointing at a run_id?

What's the point of `mlflow.autolog()` if I still need explicit `log_metric`?

Why `--backend-store-uri` Postgres + `--artifacts-destination` S3 in production?

What tags should I actually enforce?

Can I use MLflow for training on Kubernetes without the Tracking Server running as a K8s service?

What does the grader validate on each step?

MLflow Experiment Tracking: From Single Run to Team Workflow

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll wire up in this MLflow lab

Frequently asked questions

Why use the MLflow Model Registry instead of just pointing at a run_id?

What's the point of mlflow.autolog() if I still need explicit log_metric?

Why --backend-store-uri Postgres + --artifacts-destination S3 in production?

What tags should I actually enforce?

Can I use MLflow for training on Kubernetes without the Tracking Server running as a K8s service?

What does the grader validate on each step?

What's the point of `mlflow.autolog()` if I still need explicit `log_metric`?

Why `--backend-store-uri` Postgres + `--artifacts-destination` S3 in production?