RLHF & DPO Alignment
GPU sandbox · jupyter
Beta

RLHF & DPO Alignment

Run real Direct Preference Optimization on a small language model with TRL's DPOTrainer. Capture a baseline, build a preference dataset, train, and measurably shift the model's behavior in four steps.

55 min·4 steps·4 domains·Advanced·ncp-genlncp-adsnca-genlnca-genm

What you'll learn

  1. 1
    Load the base model and capture baseline outputs
  2. 2
    Build the preference dataset
  3. 3
    Train with DPOTrainer
  4. 4
    Before vs after: did alignment work?

Prerequisites

  • Fluent with Hugging Face transformers (tokenizer + causal LM generation)
  • Understanding of supervised fine-tuning loss
  • Basic grasp of what RLHF is trying to accomplish vs pretraining

Exam domains covered

AlignmentPrompt EngineeringLLM Fundamentals & ArchitectureExperimentation

Skills & technologies you'll practice

This advanced-level gpu lab gives you real-world reps across:

DPORLHFTRLPreference DataAlignmentGPT-2Hugging Face

What you'll build in this DPO alignment lab

DPO replaced PPO-based RLHF as the default alignment recipe across the open-weight ecosystem — Zephyr, Tulu, Llama 3 Instruct's preference stages, and most community instruction-tunes all shipped on trl.DPOTrainer in the last two years, because it throws out the reward model entirely and still lands measurably aligned behavior. In about 55 minutes on a real NVIDIA GPU we provision, you'll run a real DPO pass on a causal LM, build a preference dataset from scratch, and produce a measurable before/after shift — not a fake "loss went down" signal but actual text that differs on held prompts. You'll walk away understanding why DPO is simpler than PPO (no rollouts, no value network, no reward model), why it's more fragile in a specific way (uniquely sensitive to preference-data quality), and how to read the trainer.state.log_history to tell the difference between a healthy run and alignment theater.

Technically you'll package {prompt, chosen, rejected} triplets into a datasets.Dataset, instantiate trl.DPOTrainer(model, args=DPOConfig(...), train_dataset, tokenizer), and let it handle the Bradley-Terry log-sigmoid loss over the β-scaled log-ratio of policy vs reference policy under the hood. The lab deliberately uses GPT-2 and eight preference pairs — not because that's production-realistic but because it collapses a 30-minute mechanics demonstration into a 55-minute lab where you can see the alignment tax (the general-capability drift that follows narrow preference optimisation) on specific prompts. The production gap is explicit: real DPO runs use UltraFeedback (~64k pairs), HH-RLHF (~161k), or Nectar (~1M), layer multiple datasets, and frequently blend in SFT loss or iterate DPO rounds to manage capability regression. You'll also get the comparative map: KTO for thumbs-up/down data, IPO for small-dataset stability, ORPO for skipping the SFT-then-align split, PPO-RLHF for when reward-model expressivity matters more than implementation simplicity.

Prerequisites: fluency with Hugging Face causal-LM generation (.generate(), sampling params), a working sense of supervised fine-tuning loss, and a rough grasp of what RLHF is trying to accomplish versus pretraining. The sandbox has trl, transformers, datasets, GPT-2, and a compatible TRL release preinstalled. If you're Googling "DPO vs PPO RLHF", "DPOTrainer tutorial", "how to align a small LLM", "preference dataset format TRL", or "KTO vs DPO vs ORPO" — this is the hands-on answer, and the code ports directly to Llama 3 or Mistral 7B with LoRA/QLoRA for production-scale alignment.

Frequently asked questions

How is DPO mathematically different from PPO-based RLHF?

PPO-RLHF needs three moving parts: a reward model trained on preferences, a policy being optimised, and a reference policy with a KL penalty. DPO rewrites the optimisation so the reward model disappears: the Bradley-Terry likelihood of preferring chosen over rejected factors into a closed-form loss over log-ratios of the policy vs the reference, scaled by a temperature β. You still have a reference model (ref_model in DPOConfig), but you skip reward-model training entirely and use plain gradient descent — no rollouts, no value network, no importance sampling. The tradeoff is that DPO is less expressive when the reward landscape is complex but dramatically simpler and more stable for preference-shaping.

Why does the check require only one changed output, not all of them?

Because with eight preference pairs and a handful of training steps, the alignment signal is sparse. Some prompts are far enough from the training distribution that neither the chosen nor rejected gradient meaningfully touches them — the model's output stays numerically close. At least one changing proves the DPO loss actually flowed through backward and updated weights; that's the floor the check enforces. In a real alignment run you'd instead evaluate on held-out preference pairs and report win-rate against the reference policy as your metric, not a binary "did anything change".

Is eight preference pairs enough to 'align' a model?

No — it's enough to verify the pipeline runs end-to-end and produce measurable behavior change on your specific prompts. Production DPO uses thousands to millions of pairs: UltraFeedback is ~64K, HH-RLHF is ~161K, Nectar crosses a million. Anthropic, OpenAI, and Meta all layer multiple preference datasets, iterative DPO rounds, and frequently blend SFT data in. The lab's eight-pair setup is deliberately the smallest thing that still lets you see DPO's mechanics; the reflection asks you to reason about what breaks when you scale it up.

Why use GPT-2 instead of a bigger model?

Because this lab is about DPO mechanics, not about producing a useful aligned assistant. GPT-2 at 124M parameters trains fast enough to converge on eight preference pairs in under a minute, leaves plenty of VRAM for the reference model to sit alongside, and makes the before/after shift in baseline_outputs vs aligned_outputs easy to eyeball. The exact same DPOTrainer call scales up to Llama 3 or Mistral 7B — you'd just add LoRA or QLoRA (DPO-on-LoRA is standard for 7B-class alignment on a single 24 GB card).

What's `alignment tax' and does this lab exhibit it?

Alignment tax is the capability drop that often follows preference optimisation — the model gets better at the preferred behaviour and worse at general reasoning, factual recall, or code generation. It happens because the preference gradient pulls the policy toward a narrow distribution defined by your labelers, away from the richer pretraining distribution. You can catch a miniature version of it in Step 4 by picking one baseline prompt far from your preference pairs and comparing the text — often slightly less coherent after training. Mitigations in production: KL regularisation (what β controls), mixing in SFT loss, and iterative rounds that re-anchor to the reference.

When would I reach for KTO, IPO, or ORPO instead of DPO?

KTO (Kahneman-Tversky Optimization) takes per-sample thumbs-up / thumbs-down labels instead of paired preferences — useful when you have binary feedback from production traffic rather than ranked pairs. IPO (Identity Preference Optimization) replaces DPO's log-sigmoid with a squared loss to reduce overfitting on small datasets. ORPO (Odds Ratio Preference Optimization) folds SFT and preference optimisation into a single loss, skipping the reference model entirely. Pick by what your data looks like: pairs → DPO or IPO; thumbs → KTO; no SFT-then-align split available → ORPO. DPO remains the default because the library support is strongest and the hyperparameters are best understood.