RLHF & DPO Alignment
Run real Direct Preference Optimization on a small language model with TRL's DPOTrainer. Capture a baseline, build a preference dataset, train, and measurably shift the model's behavior in four steps.
What you'll learn
- 1Load the base model and capture baseline outputs
- 2Build the preference dataset
- 3Train with DPOTrainer
- 4Before vs after: did alignment work?
Prerequisites
- Fluent with Hugging Face transformers (tokenizer + causal LM generation)
- Understanding of supervised fine-tuning loss
- Basic grasp of what RLHF is trying to accomplish vs pretraining
Exam domains covered
Skills & technologies you'll practice
This advanced-level gpu lab gives you real-world reps across:
What you'll build in this DPO alignment lab
DPO replaced PPO-based RLHF as the default alignment recipe across the open-weight ecosystem — Zephyr, Tulu, Llama 3 Instruct's preference stages, and most community instruction-tunes all shipped on trl.DPOTrainer in the last two years, because it throws out the reward model entirely and still lands measurably aligned behavior. In about 55 minutes on a real NVIDIA GPU we provision, you'll run a real DPO pass on a causal LM, build a preference dataset from scratch, and produce a measurable before/after shift — not a fake "loss went down" signal but actual text that differs on held prompts. You'll walk away understanding why DPO is simpler than PPO (no rollouts, no value network, no reward model), why it's more fragile in a specific way (uniquely sensitive to preference-data quality), and how to read the trainer.state.log_history to tell the difference between a healthy run and alignment theater.
Technically you'll package {prompt, chosen, rejected} triplets into a datasets.Dataset, instantiate trl.DPOTrainer(model, args=DPOConfig(...), train_dataset, tokenizer), and let it handle the Bradley-Terry log-sigmoid loss over the β-scaled log-ratio of policy vs reference policy under the hood. The lab deliberately uses GPT-2 and eight preference pairs — not because that's production-realistic but because it collapses a 30-minute mechanics demonstration into a 55-minute lab where you can see the alignment tax (the general-capability drift that follows narrow preference optimisation) on specific prompts. The production gap is explicit: real DPO runs use UltraFeedback (~64k pairs), HH-RLHF (~161k), or Nectar (~1M), layer multiple datasets, and frequently blend in SFT loss or iterate DPO rounds to manage capability regression. You'll also get the comparative map: KTO for thumbs-up/down data, IPO for small-dataset stability, ORPO for skipping the SFT-then-align split, PPO-RLHF for when reward-model expressivity matters more than implementation simplicity.
Prerequisites: fluency with Hugging Face causal-LM generation (.generate(), sampling params), a working sense of supervised fine-tuning loss, and a rough grasp of what RLHF is trying to accomplish versus pretraining. The sandbox has trl, transformers, datasets, GPT-2, and a compatible TRL release preinstalled. If you're Googling "DPO vs PPO RLHF", "DPOTrainer tutorial", "how to align a small LLM", "preference dataset format TRL", or "KTO vs DPO vs ORPO" — this is the hands-on answer, and the code ports directly to Llama 3 or Mistral 7B with LoRA/QLoRA for production-scale alignment.
Frequently asked questions
How is DPO mathematically different from PPO-based RLHF?
chosen over rejected factors into a closed-form loss over log-ratios of the policy vs the reference, scaled by a temperature β. You still have a reference model (ref_model in DPOConfig), but you skip reward-model training entirely and use plain gradient descent — no rollouts, no value network, no importance sampling. The tradeoff is that DPO is less expressive when the reward landscape is complex but dramatically simpler and more stable for preference-shaping.Why does the check require only one changed output, not all of them?
Is eight preference pairs enough to 'align' a model?
Why use GPT-2 instead of a bigger model?
baseline_outputs vs aligned_outputs easy to eyeball. The exact same DPOTrainer call scales up to Llama 3 or Mistral 7B — you'd just add LoRA or QLoRA (DPO-on-LoRA is standard for 7B-class alignment on a single 24 GB card).