MLflow Experiment Tracking: From Single Run to Team Workflow
GPU sandbox · jupyter
Beta

MLflow Experiment Tracking: From Single Run to Team Workflow

Wire the four load-bearing pieces of MLflow into a real training loop — tracked runs with params and metrics, a registered model with stage transitions, a multi-run sweep + search, and a production spec (server, k8s Job, tags, autolog).

35 min·4 steps·2 domains·Intermediate·ncp-aionca-genlncp-adsnca-genmncp-aii

What you'll learn

  1. 1
    Instrument a training run
  2. 2
    Log the model & register it
  3. 3
    Search & compare runs
  4. 4
    Production: server, Kubernetes, tags, autolog

Prerequisites

  • PyTorch training loop basics
  • Python context managers and pandas DataFrames
  • Concept of experiment tracking / reproducibility

Exam domains covered

AI Infrastructure & OperationsMachine Learning Lifecycle

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

MLflowExperiment TrackingModel RegistryPyTorchautologsearch_runsKubernetesMLOps

What you'll wire up in this MLflow lab

MLflow stops being a notebook toy and starts being infrastructure the moment a second engineer asks 'which run produced the model we shipped last quarter?' This lab wires the four load-bearing pieces — tracked runs, Model Registry, search_runs across experiments, and the production server spec (Postgres + S3 + Kubernetes + tagging convention + autolog config). You'll walk away with a real PyTorch run instrumented with log_params/log_metric, a registered model you promoted to Staging and reloaded via stable models:/<name>/<version> URI, a multi-run sweep queried with mlflow.search_runs into a pandas DataFrame, and a production-ready mlflow_server_cmd + Kubernetes Job YAML + tagging_convention + autologging_config that the team can actually deploy. About 35 minutes on a live NVIDIA GPU pod — MLflow, PyTorch, pandas, and a scratch tracking backend are preinstalled.

The substance lives in the Registry and tags. Run IDs are disposable — they change every retrain, so production serving code can't track them. The Registry gives you stable URIs like models:/chat-intent-classifier/Production that resolve to whichever version is currently promoted; when a new model wins the evaluation gate, transition_model_version_stage(..., stage='Production') is one call and serving jobs pick it up without a config change. Tags are the invisible load-bearing piece: without enforced team, owner, env, git_commit, model_family, mlflow.search_runs(filter_string="tags.model_family = 'llm' and tags.team = '...'") degenerates into ls of run_ids and nobody can find last quarter's classifier. Autolog covers the 90% case across flavors (PyTorch, sklearn, transformers, Lightning) — training loss, hyperparameters from the model constructor, framework version, model artifact — without explicit calls. Domain metrics (business KPIs, calibration curves, safety-eval pass rates) still need log_metric because no framework's default set includes them.

The production spec is where engineers learn the layout most notebook demos skip. --backend-store-uri points at Postgres because metadata is read-heavy with transactional writes; --artifacts-destination points at S3 because models and input examples are large blobs better served from object storage. The file-URI backend is fine for a single notebook and falls over the moment a second engineer tries to list runs. The Kubernetes Job YAML injects MLFLOW_TRACKING_URI as an env var so the training pod POSTs runs over HTTP to your separately-deployed tracking server, plus nvidia.com/gpu for scheduling and an appropriate restartPolicy for Job semantics.

Prereqs: PyTorch training-loop basics, Python context managers and pandas DataFrames, some exposure to experiment tracking. Preinstalled: MLflow, PyTorch, pandas, scratch tracking backend, JupyterLab. Grading checks real structure: run_info.status must be FINISHED, logged_metrics['train_loss'] must have ≥5 steps and decrease first→last, the registered model must reload and return a valid prediction shape, all_runs_df must contain ≥3 runs with best_run matching the true minimum, and the production spec must reference Postgres backend, S3 artifacts, a Kubernetes Job with GPU scheduling, and required tags like team and owner.

Frequently asked questions

Why use the MLflow Model Registry instead of just pointing at a run_id?

Because run_ids change every retrain and production code can't track them. The Registry gives you stable URIs like models:/chat-intent-classifier/Production that resolve to whichever version is currently promoted. When a new model wins the evaluation gate, you call transition_model_version_stage(..., stage='Production') and serving jobs pick it up without a config change. That's the atomic promotion the lab teaches in Step 2 — run_ids stay in Slack and docs, URIs live in the serving config.

What's the point of mlflow.autolog() if I still need explicit log_metric?

Autolog captures the things every framework user wants — training/validation loss, hyperparameters from the model constructor, epoch summaries, framework version, model artifact — without a single explicit call. It's the 90% case. You still call log_metric for domain metrics (a business KPI, a calibration score, a safety-eval pass rate) because those aren't in any framework's default set. The lab has you author autologging_config in Step 4 so you enumerate which flavors to enable (pytorch, sklearn, transformers) and whether to log input examples.

Why --backend-store-uri Postgres + --artifacts-destination S3 in production?

The backend store holds metadata — runs, params, metrics, registry state — and it's read-heavy with transactional writes, which Postgres handles well. Artifacts (model binaries, input examples, plots) are large blobs better served from object storage. The file-URI backend you use in Step 1 is fine for a single notebook but falls over the moment a second engineer tries to list runs. The Step 4 mlflow_server_cmd forces you to write the real CLI with both URIs so you've seen the production layout.

What tags should I actually enforce?

At minimum: team (who owns it), owner (individual accountable), env (dev/staging/prod), git_commit (exact source version), and model_family (classifier, retriever, embedder, LLM fine-tune, etc.). Step 4 requires team and owner specifically; the reflection at the end asks which tag would let a new engineer six months later distinguish 'the best chat-intent classifier from last quarter' from every other classifier on the server. That's model_family, and without it every search is a text match against titles.

Can I use MLflow for training on Kubernetes without the Tracking Server running as a K8s service?

Yes — your training Job just needs MLFLOW_TRACKING_URI pointing at a reachable server (the one started by mlflow server ...), and it'll POST runs over HTTP. The Step 4 k8s_training_job YAML forces that injection as an env var, plus nvidia.com/gpu resource request and restartPolicy: for the Job semantics. The server itself typically runs as a separate Deployment behind a Service — outside this lab's scope, but a common pairing.

What does the grader validate on each step?

Step 1 requires run_info.status == 'FINISHED', logged_params contains learning_rate/batch_size/optimizer, logged_metrics['train_loss'] has ≥5 (step, value) tuples, and last loss < first loss. Step 2 validates model_artifact_path is a real MLflow URI, registered_model has name/latest_version, version_stage is one of the valid stage names, and reload_prediction_shape is a non-empty tuple. Step 3 enforces len(all_runs_df) >= 3 and best_run.final_loss matches the DataFrame minimum. Step 4 checks the server CLI flags, K8s Job primitives, required tags, and autologging_config keys.