Diffusion Model Training Process

Overview

Diffusion models are generative models that learn to reverse a gradual noising process. Training involves two processes: a fixed forward process that adds Gaussian noise to data over \( T \) timesteps until the data becomes pure noise, and a learned reverse process that denoises step by step to recover the original data.

The key insight: instead of learning to generate data from scratch, the model learns to predict and remove noise — a much simpler, stable objective.

Architecture Components

Noise schedule \( \beta_1, \ldots, \beta_T \): Controls how much noise is added at each step. Typically a linear or cosine schedule with \( 0 < \beta_t \ll 1 \).
Denoising network \( \epsilon_\theta(x_t, t) \): Usually a U-Net conditioned on timestep \( t \). Predicts the noise \( \epsilon \) added to \( x_0 \) to produce \( x_t \).
Timestep embedding: The timestep \( t \) is embedded (e.g. sinusoidal) and injected into the U-Net at each resolution level via adaptive group norm or cross-attention.

The Forward Process (Fixed)

Define \( \alpha_t = 1 - \beta_t \) and \( \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s \). The forward process adds noise at each step:

$$q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{\alpha_t}\, x_{t-1},\, \beta_t I\right)$$

The key property is that we can sample \( x_t \) directly from \( x_0 \) in closed form:

$$q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I\right)$$

which means:

$$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

This closed-form expression is what makes training efficient — we can jump to any timestep in one shot without stepping through all \( T \) steps.

Training Objective

The full variational lower bound on \( \log p(x_0) \) simplifies (Ho et al., 2020) to a weighted denoising objective. In practice the simplified loss is used:

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\left\|\epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon,\; t\right)\right\|^2\right]$$

The model is trained to predict the noise \( \epsilon \) that was added to \( x_0 \) to produce the noisy sample \( x_t \). This is a standard mean-squared error loss.

Training Process

Step 1 — Sample a clean data point

Draw a sample from the training set:

$$x_0 \sim q(x_0)$$

Step 2 — Sample a random timestep

Draw a timestep uniformly:

$$t \sim \mathcal{U}(\{1, \ldots, T\})$$

Step 3 — Sample noise and create noisy input

Draw standard Gaussian noise and corrupt \( x_0 \) using the closed-form forward process:

$$\epsilon \sim \mathcal{N}(0, I)$$ $$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$$

Step 4 — Forward pass through denoising network

Pass \( x_t \) and the timestep \( t \) through the U-Net:

$$\hat{\epsilon} = \epsilon_\theta(x_t, t)$$

The network receives both the noisy image and the timestep so it knows how much noise to expect.

Step 5 — Compute loss

$$\mathcal{L} = \|\epsilon - \hat{\epsilon}\|^2$$

Step 6 — Backpropagate and update

Standard gradient descent on the denoising network parameters \( \theta \):

$$\theta \leftarrow \theta - \alpha\, \nabla_\theta \mathcal{L}$$

Training Loop Summary

# Precompute schedule
β = linear_schedule(β_start, β_end, T)
ᾱ = cumprod(1 - β)

for step in training_steps:
    x0 = sample_batch(dataset)

    # 1. Random timestep
    t = randint(1, T, size=batch_size)

    # 2. Sample noise
    ε = randn_like(x0)

    # 3. Create noisy input (closed-form)
    x_t = sqrt(ᾱ[t]) * x0 + sqrt(1 - ᾱ[t]) * ε

    # 4. Predict noise
    ε_hat = unet(x_t, t)

    # 5. Loss and update
    loss = mse(ε, ε_hat)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Reverse Process (Sampling / Inference)

After training, generate new samples by starting from pure noise \( x_T \sim \mathcal{N}(0, I) \) and iteratively denoising:

$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\epsilon_\theta(x_t, t)\right) + \sigma_t z$$

where \( z \sim \mathcal{N}(0,I) \) for \( t > 1 \) and \( z = 0 \) for \( t = 1 \), and \( \sigma_t^2 = \beta_t \) (or a learned variance).

Sampling requires \( T \) sequential forward passes — typically 1000 steps for DDPM, or as few as 10–50 with accelerated samplers (DDIM, DPM-Solver).

Noise Schedules

Linear schedule (Ho et al., 2020): \( \beta_t \) increases linearly from \( 10^{-4} \) to \( 0.02 \). Works well for natural images but adds noise too quickly for pixel values near 0/1.
Cosine schedule (Nichol & Dhariwal, 2021): \( \bar{\alpha}_t = \cos^2\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right) \). Slower noise addition near \( t=0 \) and \( t=T \), improving training stability and sample quality.

Conditional Generation

To condition generation on a label \( y \) (e.g. class label, text embedding):

  • Classifier guidance: Train a noisy classifier \( p_\phi(y \mid x_t) \) separately. At sampling, add \( \nabla_{x_t} \log p_\phi(y \mid x_t) \) to the score — no retraining of the diffusion model needed.
  • Classifier-free guidance (Ho & Salimans, 2021): Train a single network jointly as conditional \( \epsilon_\theta(x_t, t, y) \) and unconditional \( \epsilon_\theta(x_t, t, \varnothing) \) by randomly dropping the condition. At sampling, interpolate:
    $$\hat{\epsilon} = \epsilon_\theta(x_t, t, \varnothing) + w \cdot \left(\epsilon_\theta(x_t, t, y) - \epsilon_\theta(x_t, t, \varnothing)\right)$$
    where \( w \ge 1 \) is the guidance scale. Higher \( w \) increases adherence to the condition at the cost of diversity. This is the approach used in Stable Diffusion, DALL·E 2, and Imagen.

Key Insights

  • Simple loss, powerful model: The denoising MSE objective is stable to optimize compared to GAN adversarial training or VAE posterior collapse.
  • Random timestep sampling: Each training step uses a different noise level, so the model learns to denoise across the entire spectrum in one training run.
  • Closed-form noising is the efficiency trick: \( q(x_t \mid x_0) \) in closed form means no sequential forward steps during training.
  • Slow sampling is the main cost: Generating one sample requires hundreds of sequential network passes. Accelerated samplers (DDIM, DPM-Solver) reduce this significantly.
  • Latent diffusion (LDM): Stable Diffusion operates the diffusion process in a compressed latent space (via a pretrained VAE encoder/decoder), drastically reducing compute while maintaining quality.

Common Issues & Solutions

Problem: Training instability or NaN losses
Solution: Gradient clipping; use cosine schedule instead of linear; ensure \( \bar{\alpha}_T \approx 0 \)
Problem: Blurry or low-quality samples
Solution: Increase \( T \); use cosine schedule; add more U-Net capacity (channels, attention layers)
Problem: Slow sampling
Solution: Use DDIM (deterministic, 50 steps) or DPM-Solver (as few as 10–20 steps) at inference — no retraining required
Problem: Conditioning not followed (low guidance weight)
Solution: Increase guidance scale \( w \); ensure conditioning dropout rate during training is ~10–20%

References

  • Ho et al. (2020): "Denoising Diffusion Probabilistic Models" (DDPM)
  • Song & Ermon (2020): "Score-Based Generative Modeling Through Stochastic Differential Equations"
  • Nichol & Dhariwal (2021): "Improved Denoising Diffusion Probabilistic Models"
  • Ho & Salimans (2021): "Classifier-Free Diffusion Guidance"
  • Song et al. (2022): "Denoising Diffusion Implicit Models" (DDIM)
  • Rombach et al. (2022): "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion)
← Back to Model Training