Train a Small Language Model from Scratch
Train a real GPT-style language model from zero on TinyStories: tokenize, wire up the optimizer and LR schedule, run the training loop with validation perplexity, and generate coherent text from your own weights. End-to-end pretraining in minutes on one GPU.
What you'll learn
- 1Data pipeline
- 2Model + optimizer + LR schedule
- 3Training loop with validation + perplexity
- 4Generate + checkpoint
Prerequisites
- Strong PyTorch fundamentals (nn.Module, autograd, training loops)
- Familiarity with transformer architecture
- Understanding of tokenization and cross-entropy loss
Exam domains covered
Skills & technologies you'll practice
This advanced-level gpu lab gives you real-world reps across:
What you'll build in this from-scratch SLM pretraining lab
Pretraining an LLM from random weights is the exercise that collapses the mystery of how frontier models actually come into existence. In 55 minutes you'll train a real GPT-style language model on TinyStories until it produces coherent English, starting from pure noise — same phases, same failure modes, same knobs as an 8B or 70B run, just compressed onto a single GPU in minutes instead of spanning a multi-node cluster for weeks. You walk away with a working model whose weights you saved, reloaded, and regenerated from; a visceral feel for why a warmup-then-cosine LR schedule is non-negotiable (AdamW's β2=0.999 second-moment estimate needs hundreds of steps to stabilize, so slamming full LR at step 0 makes loss diverge); and the scaling intuition for what specifically breaks when you extrapolate the same code to 7B — memory (AdamW state alone is 8 bytes/param), numerical (FP32 is too slow, FP16 loses precision at scale), and systems (a week-long run will crash, so resumable checkpointing matters more than the training loop).
The technical substance is the full pretraining anatomy at micro-scale. Data: load TinyStories, tokenize with the GPT-2 BPE tokenizer, concatenate the whole corpus into one 1-D token stream with eos_token_id marking document boundaries — variable-length per-document batches waste attention compute on padding, flat-stream random-window sampling is dramatically simpler. Model: a 2-20M parameter GPT (embed_dim ~128-256, 4-6 layers, 4-8 heads) is the sweet spot — small enough for a 100-step loop to finish in minutes, large enough to actually model TinyStories. Optimizer: AdamW with weight decay plus get_cosine_schedule_with_warmup or a custom LambdaLR — the exact curve GPT-3 and Llama pretraining use, just compressed in wall time. Training loop: sample random (batch, seq_len) windows, cross-entropy on the shifted targets, loss.backward() → optimizer.step() → scheduler.step(), and every N steps freeze-evaluate on the held-out set, reporting perplexity as math.exp(val_loss) (directly interpretable as 'the average effective branching factor per token'). Checkpoint discipline: torch.save(model.state_dict(), path) is necessary but not sufficient — the round-trip test (load into a freshly-constructed model and verify generations still match) catches the mismatched config, wrong vocab size, and missing tied-embedding wiring that silently corrupts real pretraining runs.
Prerequisites are strong PyTorch fundamentals (nn.Module, autograd, training loops), familiarity with transformer architecture (if you haven't built one by hand, the transformer-from-scratch lab is the direct prerequisite), and a working grip on tokenization plus cross-entropy loss. The sandbox is a real NVIDIA GPU pod we provision per session — TinyStories, the GPT-2 tokenizer, and PyTorch are pre-staged. Checks enforce kernel-resident state throughout: train_tokens must be a 1-D tensor of ≥100k tokens with all IDs in vocab; the model must land 0.5M-50M params on CUDA with optimizer and LR schedule present; training must log ≥20 steps with >20% loss drop between first-5 and last-5 averages plus val_losses[-1] < val_losses[0]; and the final sampled story_after must be ≥10 words at ≥60% alphabetic ratio with checkpoint_ok set to True after a real save-reload-regenerate round-trip.
Frequently asked questions
Why train a small LM from scratch when Llama 3 exists?
Why a warmup-then-cosine LR schedule instead of flat or step decay?
What does the checkpoint_ok round-trip test actually catch?
checkpoint_ok round-trip test actually catch?n_layers, different n_heads, different embed_dim → shape mismatches on load). It catches tied-embedding bugs where the output projection was sharing weights with the input embedding and the save/load didn't preserve the tying. It catches dtype mismatches when you accidentally save an fp16 state dict and reload into an fp32 shell. It catches tokenizer drift when you saved the model but forgot to pin the tokenizer version. In production pretraining where a single run spans weeks and multiple node failures, this discipline is the difference between resumable and catastrophic.Why concatenate all documents into one long token stream instead of per-document batches?
eos_token_id between them, which teaches the model that sequences terminate — without that token, windows that span document boundaries would incorrectly train the model that the end of one story continues into the next. train_tokens.reshape(-1, seq_len) or random-offset windowing are both common after that concatenation.