Train a Small Language Model from Scratch
GPU sandbox · jupyter
Beta

Train a Small Language Model from Scratch

Train a real GPT-style language model from zero on TinyStories: tokenize, wire up the optimizer and LR schedule, run the training loop with validation perplexity, and generate coherent text from your own weights. End-to-end pretraining in minutes on one GPU.

55 min·4 steps·3 domains·Advanced·ncp-genlncp-adsnca-genl

What you'll learn

  1. 1
    Data pipeline
  2. 2
    Model + optimizer + LR schedule
  3. 3
    Training loop with validation + perplexity
  4. 4
    Generate + checkpoint

Prerequisites

  • Strong PyTorch fundamentals (nn.Module, autograd, training loops)
  • Familiarity with transformer architecture
  • Understanding of tokenization and cross-entropy loss

Exam domains covered

LLM Integration and DevelopmentGPU Acceleration & Distributed TrainingExperimentation

Skills & technologies you'll practice

This advanced-level gpu lab gives you real-world reps across:

PretrainingSmall Language ModelTinyStoriesAdamWLR SchedulePerplexityCheckpointingText Generation

What you'll build in this from-scratch SLM pretraining lab

Pretraining an LLM from random weights is the exercise that collapses the mystery of how frontier models actually come into existence. In 55 minutes you'll train a real GPT-style language model on TinyStories until it produces coherent English, starting from pure noise — same phases, same failure modes, same knobs as an 8B or 70B run, just compressed onto a single GPU in minutes instead of spanning a multi-node cluster for weeks. You walk away with a working model whose weights you saved, reloaded, and regenerated from; a visceral feel for why a warmup-then-cosine LR schedule is non-negotiable (AdamW's β2=0.999 second-moment estimate needs hundreds of steps to stabilize, so slamming full LR at step 0 makes loss diverge); and the scaling intuition for what specifically breaks when you extrapolate the same code to 7B — memory (AdamW state alone is 8 bytes/param), numerical (FP32 is too slow, FP16 loses precision at scale), and systems (a week-long run will crash, so resumable checkpointing matters more than the training loop).

The technical substance is the full pretraining anatomy at micro-scale. Data: load TinyStories, tokenize with the GPT-2 BPE tokenizer, concatenate the whole corpus into one 1-D token stream with eos_token_id marking document boundaries — variable-length per-document batches waste attention compute on padding, flat-stream random-window sampling is dramatically simpler. Model: a 2-20M parameter GPT (embed_dim ~128-256, 4-6 layers, 4-8 heads) is the sweet spot — small enough for a 100-step loop to finish in minutes, large enough to actually model TinyStories. Optimizer: AdamW with weight decay plus get_cosine_schedule_with_warmup or a custom LambdaLR — the exact curve GPT-3 and Llama pretraining use, just compressed in wall time. Training loop: sample random (batch, seq_len) windows, cross-entropy on the shifted targets, loss.backward()optimizer.step()scheduler.step(), and every N steps freeze-evaluate on the held-out set, reporting perplexity as math.exp(val_loss) (directly interpretable as 'the average effective branching factor per token'). Checkpoint discipline: torch.save(model.state_dict(), path) is necessary but not sufficient — the round-trip test (load into a freshly-constructed model and verify generations still match) catches the mismatched config, wrong vocab size, and missing tied-embedding wiring that silently corrupts real pretraining runs.

Prerequisites are strong PyTorch fundamentals (nn.Module, autograd, training loops), familiarity with transformer architecture (if you haven't built one by hand, the transformer-from-scratch lab is the direct prerequisite), and a working grip on tokenization plus cross-entropy loss. The sandbox is a real NVIDIA GPU pod we provision per session — TinyStories, the GPT-2 tokenizer, and PyTorch are pre-staged. Checks enforce kernel-resident state throughout: train_tokens must be a 1-D tensor of ≥100k tokens with all IDs in vocab; the model must land 0.5M-50M params on CUDA with optimizer and LR schedule present; training must log ≥20 steps with >20% loss drop between first-5 and last-5 averages plus val_losses[-1] < val_losses[0]; and the final sampled story_after must be ≥10 words at ≥60% alphabetic ratio with checkpoint_ok set to True after a real save-reload-regenerate round-trip.

Frequently asked questions

Why train a small LM from scratch when Llama 3 exists?

Because the point is to understand the machinery, not to compete with frontier models. A 10M-parameter model on TinyStories is the smallest laboratory that still exhibits real pretraining dynamics: loss decreasing under a warmup-then-cosine schedule, validation perplexity bottoming out, checkpointing round-tripping cleanly, generation moving from token salad to recognisable English. Everything you learn here — sharded optimizer state, mixed precision, gradient accumulation, resumable checkpoints — scales structurally identically to an 8B run. You also get something no pretrained model gives you: full control over the vocab, architecture, and data, which matters when you want to prototype a new attention variant or a new data-mixing strategy.

Why a warmup-then-cosine LR schedule instead of flat or step decay?

Warmup exists because AdamW's running estimates of first and second gradient moments need a few hundred steps to stabilize; slamming the full LR in step 1 causes them to diverge and the loss spikes. Linearly ramping the LR from 0 to peak over ~5-10% of training lets those estimates settle. Cosine decay afterward cools the LR smoothly to near-zero over the remaining steps, which empirically finds a sharper loss minimum than step decay does. The exact shape is borrowed straight from GPT-3 / Chinchilla / Llama pretraining — same curve, just compressed in wall time.

What does the checkpoint_ok round-trip test actually catch?

A surprising amount. It catches missing or mismatched config when you reconstruct the model (different n_layers, different n_heads, different embed_dim → shape mismatches on load). It catches tied-embedding bugs where the output projection was sharing weights with the input embedding and the save/load didn't preserve the tying. It catches dtype mismatches when you accidentally save an fp16 state dict and reload into an fp32 shell. It catches tokenizer drift when you saved the model but forgot to pin the tokenizer version. In production pretraining where a single run spans weeks and multiple node failures, this discipline is the difference between resumable and catastrophic.

Why concatenate all documents into one long token stream instead of per-document batches?

Because variable-length document batches waste a lot of attention computation on padding, and batched random-window sampling from a flat token stream is dramatically simpler. You still mark document boundaries by inserting eos_token_id between them, which teaches the model that sequences terminate — without that token, windows that span document boundaries would incorrectly train the model that the end of one story continues into the next. train_tokens.reshape(-1, seq_len) or random-offset windowing are both common after that concatenation.

Why target 2-20M parameters specifically?

Because it's the sweet spot for this sandbox: small enough that a 100-step loop finishes in a couple of minutes on one GPU, large enough that TinyStories produces recognisable English rather than markov-chain noise, and large enough that the perplexity delta across training (say from 40 to 8) is readable against validation variance. Below ~2M you don't have enough capacity to model TinyStories; above ~50M you start starving for training tokens relative to capacity, and Chinchilla-scaling pressure shows up as the loss curve flattening early.

What would break first if I tried to scale this exact code to 7B parameters?

Memory, in three places. AdamW optimizer state alone is 8 bytes per parameter (two FP32 accumulators), so 7B × 8 = 56 GB just for the optimizer — won't fit on a single consumer GPU. FP32 activations for a single training step are also 30+ GB. Even before those, the weights themselves at FP32 are 28 GB. The fixes are compound: shard the optimizer state with ZeRO / FSDP across multiple GPUs, run the model in bf16 to halve weight and activation memory, enable gradient checkpointing to trade compute for activation memory, and accumulate gradients over multiple micro-batches. The Step 4 reflection walks through memory / numerical / systems failure modes in detail.