Continued Pre-Training: Adapt a Pretrained LM to a New Domain
Take GPT-2 and domain-adapt it to Python code in 150 steps, measuring both the gain on code and the cost in catastrophic forgetting on English. The exact recipe behind Code Llama, BloombergGPT, and every domain-specialized LLM of the last three years.
What you'll learn
- 1Baseline: measure GPT-2 on code vs English
- 2Prepare the CPT corpus
- 3Continued pretraining run
- 4Measure the trade-off: domain gain vs general forgetting
Prerequisites
- Completed basic LLM fine-tuning (Lab fine-tune-llm-lora or equivalent)
- Comfortable computing perplexity and running a PyTorch training loop
- Understanding of learning-rate schedules and catastrophic forgetting
Exam domains covered
Skills & technologies you'll practice
This advanced-level gpu lab gives you real-world reps across:
What you'll build in this continued pre-training lab
Continued pre-training (DAPT) is the recipe behind every domain-specialised LLM of the last three years — Code Llama, BloombergGPT, Galactica, Med-PaLM, and the wave of vertical models that ship with "built on Llama 3, adapted on 200B domain tokens" in the spec sheet. It's also the technique most engineers get wrong first, because it looks like fine-tuning but behaves very differently. In roughly 45 minutes on a real NVIDIA GPU we provision, you'll take vanilla GPT-2, domain-adapt it to Python code with a 150-step CPT run, and measure both what you gained (code perplexity drops >30%) and what you lost (general-English perplexity rises — the empirical face of catastrophic forgetting). You'll walk away with a concrete tradeoff curve and defensible answers for four production choices: pure CPT, replay mixing, LoRA on frozen weights, or post-hoc model souping.
The substance is the CPT-specific hyperparameter geometry. You'll run AdamW at cpt_lr = 5e-5 — roughly 8% of GPT-2's 6e-4 pretraining peak, because full-blast LR on already-converged weights blows up the loss in the first few steps — with a cosine schedule, ~10-step warmup, grad clip 1.0, and block size 256. You'll learn why documents are concatenated with eos_token_id separators into one long 1-D token stream (windows crossing document boundaries otherwise teach the model that the end of one Python file continues into the next), why a 98/2 train/val split on the code corpus is enough when both splits are well above 50k tokens, and why baseline_code_ppl often starts close to baseline_general_ppl on GPT-2 — WebText in 2019 contained substantial Stack Overflow and GitHub content, so what CPT is really doing is sharpening a code-aware token predictor into a code-specialised one. The forgetting_ratio = adapted_general_ppl / baseline_general_ppl is the signal you'd track over the whole CPT schedule in production and stop before it crosses your retention floor.
Prerequisites: comfort with computing perplexity in PyTorch, at least one basic fine-tune under your belt (the fine-tune-llm-lora lab is a fine warm-up), and knowing what a cosine schedule with warmup does. The sandbox preloads GPT-2, the tokenizer, codeparrot/codeparrot-clean, and wikitext-2-raw-v1 so streaming is local. Search-intent hooks: "continued pretraining LLM tutorial", "domain-adaptive pretraining DAPT", "catastrophic forgetting LLM", "Code Llama training recipe", "CPT vs LoRA for domain adaptation" — the reflection and rubric explicitly walk through the production decision that sits behind each of those searches.
Frequently asked questions
Why is cpt_lr = 5e-5 so much smaller than GPT-2's original pretraining LR of 6e-4?
cpt_lr = 5e-5 so much smaller than GPT-2's original pretraining LR of 6e-4?What is the forgetting_ratio actually measuring?
forgetting_ratio actually measuring?baseline_general_ppl = 40 and adapted_general_ppl = 56, forgetting_ratio = 1.4 — the adapted model is 40% worse at predicting wikitext tokens than the base. This number is the empirical face of catastrophic forgetting: by pointing the gradient only at code-distributed tokens for 150 steps, you let the general-text capability drift. In a real domain-LLM program you'd track this ratio over the CPT schedule and stop (or add replay) before it crosses your retention floor.Why concatenate documents with eos_token_id between them instead of keeping them separate?
eos_token_id between them instead of keeping them separate?eos_token_id separator is a learned signal that tells the model "the previous context ends here" — without it, windows that cross document boundaries would teach the model that the end of one Python file and the start of the next are part of the same sequence. The .reshape(-1, block_size) pattern is the other half of the trick: after concatenation you can randomly sample windows in constant time.Why does GPT-2's baseline code perplexity often start close to its baseline English perplexity?
baseline_code_ppl often lands within 2× of baseline_general_ppl, sometimes lower. This is why the Step 1 checker refuses to assert a direction — it would fail for reasons that aren't your bug. The real signal is the Step 4 delta: CPT on Python drops adapted_code_ppl by 30%+ because you've concentrated training on code and pushed the output distribution toward its sharper regions.When would I use CPT versus a LoRA adapter?
Why cosine schedule with warmup and not flat LR?
cpt_lr) prevents the Adam second-moment estimates from exploding before they've seen enough gradients to be stable. Cosine decay afterward cools the LR smoothly to near-zero over the remaining steps, which matters because CPT runs are short — you don't have room for a long plateau, and the cosine tail gives the loss a cleaner final floor than a step decay. The 150-step budget here is deliberately tight; on a real production CPT run you'd use the same shape but stretched across 10,000+ steps.