Question 1

Why train a small LM from scratch when Llama 3 exists?

Accepted Answer

Because the point is to understand the machinery, not to compete with frontier models. A 10M-parameter model on TinyStories is the smallest laboratory that still exhibits real pretraining dynamics: loss decreasing under a warmup-then-cosine schedule, validation perplexity bottoming out, checkpointing round-tripping cleanly, generation moving from token salad to recognisable English. Everything you learn here — sharded optimizer state, mixed precision, gradient accumulation, resumable checkpoints — scales structurally identically to an 8B run. You also get something no pretrained model gives you: full control over the vocab, architecture, and data, which matters when you want to prototype a new attention variant or a new data-mixing strategy.

Question 2

Why a warmup-then-cosine LR schedule instead of flat or step decay?

Accepted Answer

Warmup exists because AdamW's running estimates of first and second gradient moments need a few hundred steps to stabilize; slamming the full LR in step 1 causes them to diverge and the loss spikes. Linearly ramping the LR from 0 to peak over ~5-10% of training lets those estimates settle. Cosine decay afterward cools the LR smoothly to near-zero over the remaining steps, which empirically finds a sharper loss minimum than step decay does. The exact shape is borrowed straight from GPT-3 / Chinchilla / Llama pretraining — same curve, just compressed in wall time.

Question 3

What does the `checkpoint_ok` round-trip test actually catch?

Accepted Answer

A surprising amount. It catches missing or mismatched config when you reconstruct the model (different `n_layers`, different `n_heads`, different embed_dim → shape mismatches on load). It catches tied-embedding bugs where the output projection was sharing weights with the input embedding and the save/load didn't preserve the tying. It catches dtype mismatches when you accidentally save an `fp16` state dict and reload into an `fp32` shell. It catches tokenizer drift when you saved the model but forgot to pin the tokenizer version. In production pretraining where a single run spans weeks and multiple node failures, this discipline is the difference between resumable and catastrophic.

Question 4

Why concatenate all documents into one long token stream instead of per-document batches?

Accepted Answer

Because variable-length document batches waste a lot of attention computation on padding, and batched random-window sampling from a flat token stream is dramatically simpler. You still mark document boundaries by inserting `eos_token_id` between them, which teaches the model that sequences terminate — without that token, windows that span document boundaries would incorrectly train the model that the end of one story continues into the next. `train_tokens.reshape(-1, seq_len)` or random-offset windowing are both common after that concatenation.

Question 5

Why target 2-20M parameters specifically?

Accepted Answer

Because it's the sweet spot for this sandbox: small enough that a 100-step loop finishes in a couple of minutes on one GPU, large enough that TinyStories produces recognisable English rather than markov-chain noise, and large enough that the perplexity delta across training (say from 40 to 8) is readable against validation variance. Below ~2M you don't have enough capacity to model TinyStories; above ~50M you start starving for training tokens relative to capacity, and Chinchilla-scaling pressure shows up as the loss curve flattening early.

Question 6

What would break first if I tried to scale this exact code to 7B parameters?

Accepted Answer

Memory, in three places. AdamW optimizer state alone is 8 bytes per parameter (two FP32 accumulators), so 7B × 8 = 56 GB just for the optimizer — won't fit on a single consumer GPU. FP32 activations for a single training step are also 30+ GB. Even before those, the weights themselves at FP32 are 28 GB. The fixes are compound: shard the optimizer state with ZeRO / FSDP across multiple GPUs, run the model in bf16 to halve weight and activation memory, enable gradient checkpointing to trade compute for activation memory, and accumulate gradients over multiple micro-batches. The Step 4 reflection walks through memory / numerical / systems failure modes in detail.

Train a Small Language Model from Scratch

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this from-scratch SLM pretraining lab

Frequently asked questions

Why train a small LM from scratch when Llama 3 exists?

Why a warmup-then-cosine LR schedule instead of flat or step decay?

What does the `checkpoint_ok` round-trip test actually catch?

Why concatenate all documents into one long token stream instead of per-document batches?

Why target 2-20M parameters specifically?

What would break first if I tried to scale this exact code to 7B parameters?

Train a Small Language Model from Scratch

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this from-scratch SLM pretraining lab

Frequently asked questions

Why train a small LM from scratch when Llama 3 exists?

Why a warmup-then-cosine LR schedule instead of flat or step decay?

What does the checkpoint_ok round-trip test actually catch?

Why concatenate all documents into one long token stream instead of per-document batches?

Why target 2-20M parameters specifically?

What would break first if I tried to scale this exact code to 7B parameters?

What does the `checkpoint_ok` round-trip test actually catch?