Question 1

How is DPO mathematically different from PPO-based RLHF?

Accepted Answer

PPO-RLHF needs three moving parts: a reward model trained on preferences, a policy being optimised, and a reference policy with a KL penalty. DPO rewrites the optimisation so the reward model disappears: the Bradley-Terry likelihood of preferring `chosen` over `rejected` factors into a closed-form loss over log-ratios of the policy vs the reference, scaled by a temperature β. You still have a reference model (`ref_model` in DPOConfig), but you skip reward-model training entirely and use plain gradient descent — no rollouts, no value network, no importance sampling. The tradeoff is that DPO is less expressive when the reward landscape is complex but dramatically simpler and more stable for preference-shaping.

Question 2

Why does the check require only one changed output, not all of them?

Accepted Answer

Because with eight preference pairs and a handful of training steps, the alignment signal is sparse. Some prompts are far enough from the training distribution that neither the chosen nor rejected gradient meaningfully touches them — the model's output stays numerically close. At least one changing proves the DPO loss actually flowed through backward and updated weights; that's the floor the check enforces. In a real alignment run you'd instead evaluate on held-out preference pairs and report win-rate against the reference policy as your metric, not a binary "did anything change".

Question 3

Is eight preference pairs enough to 'align' a model?

Accepted Answer

No — it's enough to verify the pipeline runs end-to-end and produce measurable behavior change on your specific prompts. Production DPO uses thousands to millions of pairs: UltraFeedback is ~64K, HH-RLHF is ~161K, Nectar crosses a million. Anthropic, OpenAI, and Meta all layer multiple preference datasets, iterative DPO rounds, and frequently blend SFT data in. The lab's eight-pair setup is deliberately the smallest thing that still lets you see DPO's mechanics; the reflection asks you to reason about what breaks when you scale it up.

Question 4

Why use GPT-2 instead of a bigger model?

Accepted Answer

Because this lab is about DPO mechanics, not about producing a useful aligned assistant. GPT-2 at 124M parameters trains fast enough to converge on eight preference pairs in under a minute, leaves plenty of VRAM for the reference model to sit alongside, and makes the before/after shift in `baseline_outputs` vs `aligned_outputs` easy to eyeball. The exact same `DPOTrainer` call scales up to Llama 3 or Mistral 7B — you'd just add LoRA or QLoRA (DPO-on-LoRA is standard for 7B-class alignment on a single 24 GB card).

Question 5

What's `alignment tax' and does this lab exhibit it?

Accepted Answer

Alignment tax is the capability drop that often follows preference optimisation — the model gets better at the preferred behaviour and worse at general reasoning, factual recall, or code generation. It happens because the preference gradient pulls the policy toward a narrow distribution defined by your labelers, away from the richer pretraining distribution. You can catch a miniature version of it in Step 4 by picking one baseline prompt far from your preference pairs and comparing the text — often slightly less coherent after training. Mitigations in production: KL regularisation (what β controls), mixing in SFT loss, and iterative rounds that re-anchor to the reference.

Question 6

When would I reach for KTO, IPO, or ORPO instead of DPO?

Accepted Answer

KTO (Kahneman-Tversky Optimization) takes per-sample thumbs-up / thumbs-down labels instead of paired preferences — useful when you have binary feedback from production traffic rather than ranked pairs. IPO (Identity Preference Optimization) replaces DPO's log-sigmoid with a squared loss to reduce overfitting on small datasets. ORPO (Odds Ratio Preference Optimization) folds SFT and preference optimisation into a single loss, skipping the reference model entirely. Pick by what your data looks like: pairs → DPO or IPO; thumbs → KTO; no SFT-then-align split available → ORPO. DPO remains the default because the library support is strongest and the hyperparameters are best understood.

RLHF & DPO Alignment

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this DPO alignment lab

Frequently asked questions