Question 1

Why divide by `sqrt(d_k)` in scaled dot-product attention?

Accepted Answer

Because without it, the magnitudes of `Q @ K.T` grow proportionally to `d_k` (roughly — it's a sum of `d_k` products of unit-variance values). Large pre-softmax scores push softmax into saturation, where gradients effectively vanish for all but the largest entry — attention collapses to "pick one token, zero out everything else" and backprop can't learn. Dividing by `sqrt(d_k)` keeps the variance of the pre-softmax logits stable across head sizes, which keeps the softmax in its well-behaved region and the gradient flowing.

Question 2

Why pre-norm `(x + MHA(LN(x)))` instead of post-norm `LN(x + MHA(x))`?

Accepted Answer

Because pre-norm has materially better gradient flow at depth. GPT-2 used post-norm and needed careful warmup plus gradient clipping to train stably; GPT-3 moved to pre-norm and every major modern decoder-only architecture (Llama, Mistral, Qwen, Gemma) kept it. With pre-norm, the residual path is a clean identity stream that each block adds an update to, so gradients can flow through untouched from the final layer back to embeddings. Post-norm interleaves normalisation with the residual addition, which subtly attenuates gradient magnitudes per layer and gets worse the deeper you stack.

Question 3

Why apply the causal mask with `-inf` instead of just zeroing out the upper triangle after softmax?

Accepted Answer

Because softmax is non-linear: zeroing post-softmax would renormalise the remaining entries to something that's no longer a proper probability distribution, and gradients would still flow through the zeroed positions during backprop. `masked_fill(mask, float('-inf'))` applied BEFORE the softmax sends those logits to a value that exponentiates to exactly 0, which softmax's denominator safely ignores, and the upper triangle gets no gradient contribution because `d/dx exp(-inf) = 0`. The Step 1 checker enforces this by asserting `triu(attention_weights, diagonal=1).abs().max() < 1e-6` — if you masked post-softmax, that would fail on numerical rounding.

Question 4

Why is the MLP's inner dimension `4D`?

Accepted Answer

Mostly empirical tradition carried from "Attention is All You Need" — 4× expansion gives the MLP enough capacity to do meaningful non-linear transformation per block without making the parameter count of the feedforward path dominate the model. Modern architectures revisit this: SwiGLU uses roughly `8D/3` with a gate, Llama 3 uses `~14336` for a 4096-dim model (also around `3.5D` to offset the extra SwiGLU parameters), etc. The exact multiplier is a tuning knob, but 4× remains the default for vanilla GELU-based blocks.

Question 5

Why use a single `nn.Linear(D, 3*D)` for Q/K/V instead of three separate linears?

Accepted Answer

Performance. A single linear projection fuses three matrix multiplications into one BLAS call, which is noticeably faster on GPU than three sequential matmuls — the kernel launch overhead disappears and the math gets a slightly larger batched GEMM. The output is then sliced or chunked into Q/K/V along the last dimension. Functionally it's identical to three separate linears with the same weights; it's purely a compute-efficiency refactor, and every serious production implementation does it.

Question 6

What's the smallest architectural change that turns this from GPT-style into Llama-style?

Accepted Answer

Four surgical swaps, in order of how much they change: replace `nn.LayerNorm` with `RMSNorm` (drop mean-centering and bias, keep the scaling parameter), swap the learned positional embedding in the embedding lookup for RoPE rotation applied to Q and K inside multi-head attention, replace `Linear(D, 4D) → GELU → Linear(4D, D)` with SwiGLU (`Linear(D, 2 * hidden) → split into gate and up, silu(gate) * up → Linear(hidden, D)`), and switch vanilla MHA to grouped-query attention by projecting fewer K/V heads than Q heads and broadcasting. Each change is 10-30 lines of code against the lab's scaffolding, and each has a cleanly measurable win on the scaling-law / inference-cost axis it targets.

Build a Transformer from Scratch: Attention, Masking & LayerNorm

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this transformer-from-scratch lab

Frequently asked questions

Why divide by `sqrt(d_k)` in scaled dot-product attention?

Why pre-norm `(x + MHA(LN(x)))` instead of post-norm `LN(x + MHA(x))`?

Why apply the causal mask with `-inf` instead of just zeroing out the upper triangle after softmax?

Why is the MLP's inner dimension `4D`?

Why use a single `nn.Linear(D, 3*D)` for Q/K/V instead of three separate linears?

What's the smallest architectural change that turns this from GPT-style into Llama-style?

Build a Transformer from Scratch: Attention, Masking & LayerNorm

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this transformer-from-scratch lab

Frequently asked questions

Why divide by sqrt(d_k) in scaled dot-product attention?

Why pre-norm (x + MHA(LN(x))) instead of post-norm LN(x + MHA(x))?

Why apply the causal mask with -inf instead of just zeroing out the upper triangle after softmax?

Why is the MLP's inner dimension 4D?

Why use a single nn.Linear(D, 3*D) for Q/K/V instead of three separate linears?

What's the smallest architectural change that turns this from GPT-style into Llama-style?

Why divide by `sqrt(d_k)` in scaled dot-product attention?

Why pre-norm `(x + MHA(LN(x)))` instead of post-norm `LN(x + MHA(x))`?

Why apply the causal mask with `-inf` instead of just zeroing out the upper triangle after softmax?

Why is the MLP's inner dimension `4D`?

Why use a single `nn.Linear(D, 3*D)` for Q/K/V instead of three separate linears?