Data Preparation for LLM Training
GPU sandbox · jupyter
Beta

Data Preparation for LLM Training

Build a real pretraining/instruction data pipeline: load a raw corpus, apply quality filters, deduplicate, train a BPE tokenizer, and batch-validate on GPU. This is the unglamorous work that actually decides how good your model will be.

45 min·5 steps·3 domains·Intermediate·ncp-genlnca-genlncp-ads

What you'll learn

  1. 1
    Load and explore the raw corpus
  2. 2
    Quality filter
  3. 3
    Exact-hash deduplication
  4. 4
    Train a BPE tokenizer
  5. 5
    Format for instruction tuning and validate on GPU

Prerequisites

  • Comfortable with Python lists, dicts, and basic string manipulation
  • Familiarity with Hugging Face datasets and tokenizers
  • Basic understanding of PyTorch tensors and GPU tensors

Exam domains covered

Data Analysis and VisualizationLLM Fundamentals & ArchitectureExperimentation

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

Data CurationTokenizerBPEDeduplicationQuality FilteringInstruction TuningHugging Face

What you'll build in this LLM data-prep lab

Data prep is the unglamorous work that actually determines how good an LLM turns out — the Llama 3 paper spends more pages on filtering, dedup, and source mixing than on architecture, and DBRX's post-mortem pinned most of its quality gain on data pipeline improvements rather than model changes. In about 45 minutes on a real NVIDIA GPU we provision, you'll build a working pretraining/instruction data pipeline end-to-end (load, filter, dedup, train a BPE tokenizer, batch-validate on GPU) and come out with concrete answers to the questions that actually matter in production: why is exact-hash dedup only a floor, why is the tokenizer the one decision you can never walk back, and which stage in your pipeline is silently dropping the samples you most wanted to keep.

Technically you'll stream a 1,000+ sample Hugging Face corpus (wikitext, c4, or similar), apply a quality filter on minimum length and alphanumeric ratio (with a deliberately inverted inequality you have to fix — the exact mistake that has shipped to production at real shops), normalise and md5-hash for exact dedup while keeping the original text untouched, then train a real tokenizers.Tokenizer(models.BPE()) with a ByteLevel pre-tokenizer and BpeTrainer(vocab_size=8000) via train_from_iterator. The final step batches prompt/response pairs on CUDA with padding + attention masks and runs a forward pass through GPT-2 to confirm the pipeline produces tensors a real model can consume without crashing. The pointed lesson is scale: exact-hash dedup leaves ~30-50% of effective duplicates in place on Common Crawl-type data, which is why production pipelines layer MinHash + LSH (shingle near-dupes), SimHash, or embedding-based semantic clustering on top. You'll also see why ByteLevel beats Whitespace or SentencePiece BPE for modern LLMs (every byte is representable, so OOV is impossible at inference), and why vocab size is fixed for the life of the model.

Prerequisites: comfort with Python string handling, Hugging Face datasets and tokenizers, and basic PyTorch tensors. The sandbox has HF libraries, the BPE trainer, and GPT-2 weights preinstalled. Search-intent hooks: "how to prepare data for LLM training", "deduplicate pretraining corpus", "train BPE tokenizer from scratch", "MinHash LSH dedup", "instruction tuning data format" — the lab and the reflection cover each. The grader is strict where it matters — re-hashing the dedup output to prove all survivors are unique, confirming tokenisation actually compresses versus a whitespace split, and enforcing GPU-resident tensors with a realistic vocab dim before you move on.

Frequently asked questions

Why not just use dataset.filter() and dataset.map() everywhere?

You can, and at production scale you should. The lab implements filter and dedup by hand so you see the rules — 100-char minimum, alphanumeric ratio, normalization before hashing — rather than burying them in a helper. Once you've written the plain-Python version, swapping to Arrow-backed datasets operations is a trivial speed optimization. Getting the rules wrong, on the other hand, silently drops the wrong samples and is invisible until your loss curve misbehaves two days into training.

Why is exact-hash deduplication 'only a floor'?

Because it catches byte-identical strings and nothing else. At web scale, near-duplicates dominate: boilerplate headers and footers, templated product descriptions, mirrored articles with one-word edits, paraphrased Wikipedia. Production pipelines layer on MinHash + LSH (for shingle-based near-dupe detection), SimHash, or semantic dedup via embedding clustering. Exact-hash is still worth running first because it's cheap and removes the obvious case, but on its own it leaves ~30-50% of effective duplicates in place on common crawl-type corpora.

Why a ByteLevel pre-tokenizer for BPE?

ByteLevel is what GPT-2 pioneered and what most modern open LLMs still use. It maps every byte to a printable Unicode character before BPE runs, which guarantees the tokenizer never sees an 'unknown' symbol at inference — any input is representable, because every input byte has a mapping. Alternatives (Whitespace pre-tokenizer, SentencePiece BPE, Unigram) handle OOV differently and change downstream behavior. ByteLevel gives you coverage + reasonable compression + compatibility with the GPT-2/Llama tokenizer family, which is why it's the default.

Why does tokenizer vocab size matter so much?

Because it's fixed for the life of the model — every later decision (embedding table size, token budget, maximum context cost, per-step compute) is downstream of it. Too small: the tokenizer splits common words into many pieces, inflating your sequence lengths and shrinking effective context. Too large: the embedding and output head balloon, wasting parameters on rare tokens. The 8,000 setting in this lab is small on purpose (fast to train, easy to inspect); real LLMs use 32k-128k depending on language coverage.

Why feed the instruction batch through GPT-2 at the end?

To prove the pipeline actually produces tensors a real model can consume without crashing. Step 5 checks that input_ids and attention_mask have matching shapes, live on CUDA, and that the forward pass returns logits with the right batch/seq/vocab dimensions. This is the smallest-possible end-to-end sanity check: a pipeline that processes ten million samples but fails to produce a valid batch tensor is worse than useless, and catching it at sample size 4 is cheap.

What's genuinely missing from this pipeline compared to Llama 3 / DBRX scale?

Language ID (keep English-only or build language-aware mixes), near-duplicate detection (MinHash/LSH), toxicity and PII scrubbing, source-weighted mixing (code, web, books at tuned ratios), per-source quality classifiers (the Llama 3 paper trained dedicated filters), and proper shuffling across shards. The reflection step pushes you to name which of these you'd build first and why — and the honest answer is usually 'near-duplicate dedup' because it's the biggest silent-quality win at scale.