Continued Pre-Training: Adapt a Pretrained LM to a New Domain
GPU sandbox · jupyter
Beta

Continued Pre-Training: Adapt a Pretrained LM to a New Domain

Take GPT-2 and domain-adapt it to Python code in 150 steps, measuring both the gain on code and the cost in catastrophic forgetting on English. The exact recipe behind Code Llama, BloombergGPT, and every domain-specialized LLM of the last three years.

45 min·4 steps·2 domains·Advanced·ncp-genlncp-adsnca-genl

What you'll learn

  1. 1
    Baseline: measure GPT-2 on code vs English
  2. 2
    Prepare the CPT corpus
  3. 3
    Continued pretraining run
  4. 4
    Measure the trade-off: domain gain vs general forgetting

Prerequisites

  • Completed basic LLM fine-tuning (Lab fine-tune-llm-lora or equivalent)
  • Comfortable computing perplexity and running a PyTorch training loop
  • Understanding of learning-rate schedules and catastrophic forgetting

Exam domains covered

LLM Fine-Tuning & AdaptationGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This advanced-level gpu lab gives you real-world reps across:

Continued PretrainingDAPTGPT-2Catastrophic ForgettingHuggingFacePerplexityLearning RateDomain Adaptation

What you'll build in this continued pre-training lab

Continued pre-training (DAPT) is the recipe behind every domain-specialised LLM of the last three years — Code Llama, BloombergGPT, Galactica, Med-PaLM, and the wave of vertical models that ship with "built on Llama 3, adapted on 200B domain tokens" in the spec sheet. It's also the technique most engineers get wrong first, because it looks like fine-tuning but behaves very differently. In roughly 45 minutes on a real NVIDIA GPU we provision, you'll take vanilla GPT-2, domain-adapt it to Python code with a 150-step CPT run, and measure both what you gained (code perplexity drops >30%) and what you lost (general-English perplexity rises — the empirical face of catastrophic forgetting). You'll walk away with a concrete tradeoff curve and defensible answers for four production choices: pure CPT, replay mixing, LoRA on frozen weights, or post-hoc model souping.

The substance is the CPT-specific hyperparameter geometry. You'll run AdamW at cpt_lr = 5e-5 — roughly 8% of GPT-2's 6e-4 pretraining peak, because full-blast LR on already-converged weights blows up the loss in the first few steps — with a cosine schedule, ~10-step warmup, grad clip 1.0, and block size 256. You'll learn why documents are concatenated with eos_token_id separators into one long 1-D token stream (windows crossing document boundaries otherwise teach the model that the end of one Python file continues into the next), why a 98/2 train/val split on the code corpus is enough when both splits are well above 50k tokens, and why baseline_code_ppl often starts close to baseline_general_ppl on GPT-2 — WebText in 2019 contained substantial Stack Overflow and GitHub content, so what CPT is really doing is sharpening a code-aware token predictor into a code-specialised one. The forgetting_ratio = adapted_general_ppl / baseline_general_ppl is the signal you'd track over the whole CPT schedule in production and stop before it crosses your retention floor.

Prerequisites: comfort with computing perplexity in PyTorch, at least one basic fine-tune under your belt (the fine-tune-llm-lora lab is a fine warm-up), and knowing what a cosine schedule with warmup does. The sandbox preloads GPT-2, the tokenizer, codeparrot/codeparrot-clean, and wikitext-2-raw-v1 so streaming is local. Search-intent hooks: "continued pretraining LLM tutorial", "domain-adaptive pretraining DAPT", "catastrophic forgetting LLM", "Code Llama training recipe", "CPT vs LoRA for domain adaptation" — the reflection and rubric explicitly walk through the production decision that sits behind each of those searches.

Frequently asked questions

Why is cpt_lr = 5e-5 so much smaller than GPT-2's original pretraining LR of 6e-4?

Because you're not starting from randomly-initialised weights — you're nudging a trained distribution. Full-blast pretraining LR on an already-converged model would blow up the loss in the first few steps, bouncing weights through sharp regions of the loss surface that the original training carefully cooled past. CPT convention is 1-10% of the original peak LR, with the exact fraction tuned to how aggressive a shift you want. 5e-5 lands in the middle of that band — aggressive enough to see a >30% code perplexity drop in 150 steps, gentle enough that the loss curve stays monotonic rather than spiking.

What is the forgetting_ratio actually measuring?

The relative degradation of the model's general-English next-token loss after CPT. If baseline_general_ppl = 40 and adapted_general_ppl = 56, forgetting_ratio = 1.4 — the adapted model is 40% worse at predicting wikitext tokens than the base. This number is the empirical face of catastrophic forgetting: by pointing the gradient only at code-distributed tokens for 150 steps, you let the general-text capability drift. In a real domain-LLM program you'd track this ratio over the CPT schedule and stop (or add replay) before it crosses your retention floor.

Why concatenate documents with eos_token_id between them instead of keeping them separate?

Because causal language modelling trains on contiguous 256-token windows with shifted targets, and the cleanest way to stream variable-length documents into that loop is one long 1-D token stream. The eos_token_id separator is a learned signal that tells the model "the previous context ends here" — without it, windows that cross document boundaries would teach the model that the end of one Python file and the start of the next are part of the same sequence. The .reshape(-1, block_size) pattern is the other half of the trick: after concatenation you can randomly sample windows in constant time.

Why does GPT-2's baseline code perplexity often start close to its baseline English perplexity?

Because GPT-2's training corpus (WebText — Reddit-outbound pages scraped in 2019) contained substantial code from Stack Overflow, GitHub gists linked in discussions, and tutorial sites. That means baseline_code_ppl often lands within 2× of baseline_general_ppl, sometimes lower. This is why the Step 1 checker refuses to assert a direction — it would fail for reasons that aren't your bug. The real signal is the Step 4 delta: CPT on Python drops adapted_code_ppl by 30%+ because you've concentrated training on code and pushed the output distribution toward its sharper regions.

When would I use CPT versus a LoRA adapter?

CPT when you want the base weights themselves to shift — Code Llama, BloombergGPT, medical-specialised LLMs all went this route. It's more expressive (no rank bottleneck), but it's heavier (updating every parameter) and causes forgetting you have to actively manage via replay or adapter averaging. LoRA when you want adaptation without touching the base — cheaper, lets you ship multiple domain adapters against one frozen model, and caps forgetting mechanically because the base weights literally can't change. In practice frontier teams often do CPT first on a large domain corpus, then LoRA on top for per-customer fine-tunes.

Why cosine schedule with warmup and not flat LR?

Warmup (the first ~10 steps linearly ramping from 0 to cpt_lr) prevents the Adam second-moment estimates from exploding before they've seen enough gradients to be stable. Cosine decay afterward cools the LR smoothly to near-zero over the remaining steps, which matters because CPT runs are short — you don't have room for a long plateau, and the cosine tail gives the loss a cleaner final floor than a step decay. The 150-step budget here is deliberately tight; on a real production CPT run you'd use the same shape but stretched across 10,000+ steps.