MLflow Experiment Tracking: From Single Run to Team Workflow
Wire the four load-bearing pieces of MLflow into a real training loop — tracked runs with params and metrics, a registered model with stage transitions, a multi-run sweep + search, and a production spec (server, k8s Job, tags, autolog).
What you'll learn
- 1Instrument a training run
- 2Log the model & register it
- 3Search & compare runs
- 4Production: server, Kubernetes, tags, autolog
Prerequisites
- PyTorch training loop basics
- Python context managers and pandas DataFrames
- Concept of experiment tracking / reproducibility
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll wire up in this MLflow lab
MLflow stops being a notebook toy and starts being infrastructure the moment a second engineer asks 'which run produced the model we shipped last quarter?' This lab wires the four load-bearing pieces — tracked runs, Model Registry, search_runs across experiments, and the production server spec (Postgres + S3 + Kubernetes + tagging convention + autolog config). You'll walk away with a real PyTorch run instrumented with log_params/log_metric, a registered model you promoted to Staging and reloaded via stable models:/<name>/<version> URI, a multi-run sweep queried with mlflow.search_runs into a pandas DataFrame, and a production-ready mlflow_server_cmd + Kubernetes Job YAML + tagging_convention + autologging_config that the team can actually deploy. About 35 minutes on a live NVIDIA GPU pod — MLflow, PyTorch, pandas, and a scratch tracking backend are preinstalled.
The substance lives in the Registry and tags. Run IDs are disposable — they change every retrain, so production serving code can't track them. The Registry gives you stable URIs like models:/chat-intent-classifier/Production that resolve to whichever version is currently promoted; when a new model wins the evaluation gate, transition_model_version_stage(..., stage='Production') is one call and serving jobs pick it up without a config change. Tags are the invisible load-bearing piece: without enforced team, owner, env, git_commit, model_family, mlflow.search_runs(filter_string="tags.model_family = 'llm' and tags.team = '...'") degenerates into ls of run_ids and nobody can find last quarter's classifier. Autolog covers the 90% case across flavors (PyTorch, sklearn, transformers, Lightning) — training loss, hyperparameters from the model constructor, framework version, model artifact — without explicit calls. Domain metrics (business KPIs, calibration curves, safety-eval pass rates) still need log_metric because no framework's default set includes them.
The production spec is where engineers learn the layout most notebook demos skip. --backend-store-uri points at Postgres because metadata is read-heavy with transactional writes; --artifacts-destination points at S3 because models and input examples are large blobs better served from object storage. The file-URI backend is fine for a single notebook and falls over the moment a second engineer tries to list runs. The Kubernetes Job YAML injects MLFLOW_TRACKING_URI as an env var so the training pod POSTs runs over HTTP to your separately-deployed tracking server, plus nvidia.com/gpu for scheduling and an appropriate restartPolicy for Job semantics.
Prereqs: PyTorch training-loop basics, Python context managers and pandas DataFrames, some exposure to experiment tracking. Preinstalled: MLflow, PyTorch, pandas, scratch tracking backend, JupyterLab. Grading checks real structure: run_info.status must be FINISHED, logged_metrics['train_loss'] must have ≥5 steps and decrease first→last, the registered model must reload and return a valid prediction shape, all_runs_df must contain ≥3 runs with best_run matching the true minimum, and the production spec must reference Postgres backend, S3 artifacts, a Kubernetes Job with GPU scheduling, and required tags like team and owner.
Frequently asked questions
Why use the MLflow Model Registry instead of just pointing at a run_id?
models:/chat-intent-classifier/Production that resolve to whichever version is currently promoted. When a new model wins the evaluation gate, you call transition_model_version_stage(..., stage='Production') and serving jobs pick it up without a config change. That's the atomic promotion the lab teaches in Step 2 — run_ids stay in Slack and docs, URIs live in the serving config.What's the point of mlflow.autolog() if I still need explicit log_metric?
mlflow.autolog() if I still need explicit log_metric?log_metric for domain metrics (a business KPI, a calibration score, a safety-eval pass rate) because those aren't in any framework's default set. The lab has you author autologging_config in Step 4 so you enumerate which flavors to enable (pytorch, sklearn, transformers) and whether to log input examples.Why --backend-store-uri Postgres + --artifacts-destination S3 in production?
--backend-store-uri Postgres + --artifacts-destination S3 in production?mlflow_server_cmd forces you to write the real CLI with both URIs so you've seen the production layout.What tags should I actually enforce?
team (who owns it), owner (individual accountable), env (dev/staging/prod), git_commit (exact source version), and model_family (classifier, retriever, embedder, LLM fine-tune, etc.). Step 4 requires team and owner specifically; the reflection at the end asks which tag would let a new engineer six months later distinguish 'the best chat-intent classifier from last quarter' from every other classifier on the server. That's model_family, and without it every search is a text match against titles.Can I use MLflow for training on Kubernetes without the Tracking Server running as a K8s service?
Job just needs MLFLOW_TRACKING_URI pointing at a reachable server (the one started by mlflow server ...), and it'll POST runs over HTTP. The Step 4 k8s_training_job YAML forces that injection as an env var, plus nvidia.com/gpu resource request and restartPolicy: for the Job semantics. The server itself typically runs as a separate Deployment behind a Service — outside this lab's scope, but a common pairing.What does the grader validate on each step?
run_info.status == 'FINISHED', logged_params contains learning_rate/batch_size/optimizer, logged_metrics['train_loss'] has ≥5 (step, value) tuples, and last loss < first loss. Step 2 validates model_artifact_path is a real MLflow URI, registered_model has name/latest_version, version_stage is one of the valid stage names, and reload_prediction_shape is a non-empty tuple. Step 3 enforces len(all_runs_df) >= 3 and best_run.final_loss matches the DataFrame minimum. Step 4 checks the server CLI flags, K8s Job primitives, required tags, and autologging_config keys.