Stable Diffusion + LoRA
GPU sandbox · jupyter
Beta

Stable Diffusion + LoRA

Load Stable Diffusion, attach LoRA adapters to the U-Net's attention layers, run a tiny overfit training loop, and generate with the adapted weights to prove that a few million trainable parameters actually move pixels.

45 min·4 steps·2 domains·Intermediate·nca-genmncp-genl

What you'll learn

  1. 1
    Load SD, inspect components, generate baseline
  2. 2
    Attach LoRA adapters to the U-Net
  3. 3
    Tiny LoRA training loop (mechanics, not quality)
  4. 4
    Generate with the LoRA-adapted U-Net

Prerequisites

  • Comfortable with PyTorch and Hugging Face diffusers
  • Basic understanding of diffusion models (U-Net, VAE, text encoder)
  • Familiarity with PEFT / LoRA concepts at a conceptual level

Exam domains covered

Fine-Tuning & CustomizationGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

Stable DiffusionLoRAPEFTU-NetDiffusersFine-tuningText-to-ImageAdapters

What you'll build in this Stable Diffusion LoRA lab

Training a Stable Diffusion LoRA is the fastest way an engineer can actually ship a custom image model in 2026 — the entire Civitai ecosystem, every branded style adapter, and every character LoRA you've downloaded off Hugging Face is this exact recipe. In roughly 45 minutes on a real NVIDIA GPU pod we provision, you'll walk away with rank-4 LoRA adapters injected into an SD 1.5 U-Net, a mental model of why style lives in cross-attention (text->image) while identity lives in self-attention, concrete numbers for how many parameters you're actually training (~1-2M of the ~860M U-Net), and a byte-level pixel diff that proves your adapters moved real pixels — not just that the loss dropped.

Technically the lab targets the to_q, to_k, to_v, to_out projections inside the U-Net's down-, mid-, and up-block attention modules using PEFT (LoraConfig + get_peft_model), runs a deliberately tiny overfit loop to make the mechanics legible in ~30 seconds of GPU time, then regenerates with the same prompt and seed so any pixel difference is causally attributable to the adapter weights. The deeper lesson is how to validate LoRA training: a mean-absolute-pixel-diff above 0.5 is a floor — it proves the adapter is wired, the optimiser is stepping, and the scheduler is loading adapted weights at generation time — but it is NOT a quality signal. Real style validation needs held-out prompts, CLIP-score, and FID against a reference corpus, which is exactly the trap engineers fall into when they ship a LoRA that looked great on the training prompt and mode-collapses everywhere else. You'll also see how to export the adapter (peft_model.save_pretrained) as a few-MB file that hot-swaps at generation time.

Prerequisites: PyTorch fluency, rough familiarity with diffusers anatomy (U-Net, VAE, text encoder), and a conceptual grasp of what PEFT adapters do. The sandbox ships Stable Diffusion 1.5 weights, diffusers, peft, and accelerate preinstalled, so there's no download, no CUDA version pinning, no pip install cascade. Search-intent wise, this is the lab if you're Googling "train LoRA on Stable Diffusion", "PEFT diffusion fine-tuning", "LoRA rank vs alpha for diffusion", or "why isn't my LoRA changing the output" — those answers are embedded in the grader and the reflection.

Frequently asked questions

Why rank=4 and why only attention layers?

Because SD 1.5's attention modules are where style and composition are shaped, and 4-dimensional low-rank updates are enough to bias those modules without destroying the base knowledge. Rank-4 on to_q, to_k, to_v, to_out across the U-Net lands around 1-2M trainable parameters, which is <0.3% of the ~860M total. Going higher (rank 16-32) is standard for character/identity adapters where you need more capacity; going lower (rank 1-2) is common for subtle style LoRAs. Putting LoRA on the full U-Net — including conv layers — is an option but costs more parameters for a smaller marginal gain on style tasks.

Why does the check require mean_abs_diff > 0.5 instead of asserting the image looks 'better'?

Because automatic "is this image better" is an unsolved problem, but "did the adapters actually modify the diffusion trajectory" is a trivially checkable signal. A mean absolute pixel diff of 0 — with the same prompt and same seed — means the adapters are either not wired, not active during generation, or being trained at learning_rate=0. Above 0.5 on a 0-255 scale means the optimizer moved weights, the modified attention was used, and the diffusion trajectory diverged. That's the floor; whether the output is actually good is what the Step 4 reflection is for.

Why overfit on a single image? That's usually a bug.

In a production style-LoRA training job it absolutely is. Here it's deliberately the fastest way to see the LoRA mechanism move pixels in ~30 seconds of GPU time. A single-image overfit guarantees the gradient is large and consistent, which means even rank-4 adapters drift visibly within 10 steps. The Step 3 reflection and the rubric pointedly note that validating on the training prompt is a trap — it'll always look right because the adapter memorised it — which is the core methodological lesson.

What's the difference between a style LoRA and a character / identity LoRA?

Style LoRAs push the whole output distribution toward a look — cyberpunk neon, watercolor, vintage film — and can often be trained at rank 2-8 on cross-attention layers alone. Character / identity LoRAs need to encode specific facial geometry and clothing, which is higher-frequency information, so they typically want rank 16-64, training on self-attention modules too, and often augment with regularisation images to prevent the whole model collapsing onto that one identity. Dreambooth variants add a class-preservation loss on top of LoRA to keep the broader subject class intact.

Why use the same seed for baseline and adapted generation?

Because Stable Diffusion is deterministic given the initial noise, the scheduler, and the text embeddings. Fixing the seed freezes everything except the U-Net's weights, so any pixel difference between baseline_image and adapted_image is causally attributable to the LoRA adapters. Use a different seed and you can't distinguish "the LoRA changed the trajectory" from "we sampled a different starting noise" — the check would become noise-dominated.

Can I save this LoRA and share it like the ones on Civitai or Hugging Face?

Yes — that's half of why the format exists. After training, peft_model.save_pretrained('./my-lora') writes only the adapter weights plus an adapter_config.json, typically a few MB. Anyone with the matching SD 1.5 base can load them via PeftModel.from_pretrained(...) or the diffusers pipe.load_lora_weights(...) helper. Most community LoRAs are a handful of megabytes and can be hot-swapped at generation time, which is why the ecosystem moved to LoRAs over full Dreambooth finetunes.