Question 1

Why is `cpt_lr = 5e-5` so much smaller than GPT-2's original pretraining LR of 6e-4?

Accepted Answer

Because you're not starting from randomly-initialised weights — you're nudging a trained distribution. Full-blast pretraining LR on an already-converged model would blow up the loss in the first few steps, bouncing weights through sharp regions of the loss surface that the original training carefully cooled past. CPT convention is 1-10% of the original peak LR, with the exact fraction tuned to how aggressive a shift you want. 5e-5 lands in the middle of that band — aggressive enough to see a >30% code perplexity drop in 150 steps, gentle enough that the loss curve stays monotonic rather than spiking.

Question 2

What is the `forgetting_ratio` actually measuring?

Accepted Answer

The relative degradation of the model's general-English next-token loss after CPT. If `baseline_general_ppl = 40` and `adapted_general_ppl = 56`, `forgetting_ratio = 1.4` — the adapted model is 40% worse at predicting wikitext tokens than the base. This number is the empirical face of catastrophic forgetting: by pointing the gradient only at code-distributed tokens for 150 steps, you let the general-text capability drift. In a real domain-LLM program you'd track this ratio over the CPT schedule and stop (or add replay) before it crosses your retention floor.

Question 3

Why concatenate documents with `eos_token_id` between them instead of keeping them separate?

Accepted Answer

Because causal language modelling trains on contiguous 256-token windows with shifted targets, and the cleanest way to stream variable-length documents into that loop is one long 1-D token stream. The `eos_token_id` separator is a learned signal that tells the model "the previous context ends here" — without it, windows that cross document boundaries would teach the model that the end of one Python file and the start of the next are part of the same sequence. The `.reshape(-1, block_size)` pattern is the other half of the trick: after concatenation you can randomly sample windows in constant time.

Question 4

Why does GPT-2's baseline code perplexity often start close to its baseline English perplexity?

Accepted Answer

Because GPT-2's training corpus (WebText — Reddit-outbound pages scraped in 2019) contained substantial code from Stack Overflow, GitHub gists linked in discussions, and tutorial sites. That means `baseline_code_ppl` often lands within 2× of `baseline_general_ppl`, sometimes lower. This is why the Step 1 checker refuses to assert a direction — it would fail for reasons that aren't your bug. The real signal is the Step 4 delta: CPT on Python drops `adapted_code_ppl` by 30%+ because you've concentrated training on code and pushed the output distribution toward its sharper regions.

Question 5

When would I use CPT versus a LoRA adapter?

Accepted Answer

CPT when you want the base weights themselves to shift — Code Llama, BloombergGPT, medical-specialised LLMs all went this route. It's more expressive (no rank bottleneck), but it's heavier (updating every parameter) and causes forgetting you have to actively manage via replay or adapter averaging. LoRA when you want adaptation without touching the base — cheaper, lets you ship multiple domain adapters against one frozen model, and caps forgetting mechanically because the base weights literally can't change. In practice frontier teams often do CPT first on a large domain corpus, then LoRA on top for per-customer fine-tunes.

Question 6

Why cosine schedule with warmup and not flat LR?

Accepted Answer

Warmup (the first ~10 steps linearly ramping from 0 to `cpt_lr`) prevents the Adam second-moment estimates from exploding before they've seen enough gradients to be stable. Cosine decay afterward cools the LR smoothly to near-zero over the remaining steps, which matters because CPT runs are short — you don't have room for a long plateau, and the cosine tail gives the loss a cleaner final floor than a step decay. The 150-step budget here is deliberately tight; on a real production CPT run you'd use the same shape but stretched across 10,000+ steps.

Continued Pre-Training: Adapt a Pretrained LM to a New Domain

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this continued pre-training lab

Frequently asked questions

Why is `cpt_lr = 5e-5` so much smaller than GPT-2's original pretraining LR of 6e-4?

What is the `forgetting_ratio` actually measuring?

Why concatenate documents with `eos_token_id` between them instead of keeping them separate?

Why does GPT-2's baseline code perplexity often start close to its baseline English perplexity?

When would I use CPT versus a LoRA adapter?

Why cosine schedule with warmup and not flat LR?

Continued Pre-Training: Adapt a Pretrained LM to a New Domain

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this continued pre-training lab

Frequently asked questions

Why is cpt_lr = 5e-5 so much smaller than GPT-2's original pretraining LR of 6e-4?

What is the forgetting_ratio actually measuring?

Why concatenate documents with eos_token_id between them instead of keeping them separate?

Why does GPT-2's baseline code perplexity often start close to its baseline English perplexity?

When would I use CPT versus a LoRA adapter?

Why cosine schedule with warmup and not flat LR?

Why is `cpt_lr = 5e-5` so much smaller than GPT-2's original pretraining LR of 6e-4?

What is the `forgetting_ratio` actually measuring?

Why concatenate documents with `eos_token_id` between them instead of keeping them separate?