Diffusion Model Training Process

Overview

Diffusion models are generative models that learn to reverse a gradual noising process. Training involves two processes: a fixed forward process that adds Gaussian noise to data over $ T $ timesteps until the data becomes pure noise, and a learned reverse process that denoises step by step to recover the original data.

The key insight: instead of learning to generate data from scratch, the model learns to predict and remove noise — a much simpler, stable objective.

Architecture Components

          Noise schedule \( \beta_1, \ldots, \beta_T \): Controls how much noise is added at each step. Typically a linear or cosine schedule with \( 0 < \beta_t \ll 1 \).
        

          Denoising network \( \epsilon_\theta(x_t, t) \): Usually a U-Net conditioned on timestep \( t \). Predicts the noise \( \epsilon \) added to \( x_0 \) to produce \( x_t \).
        

          Timestep embedding: The timestep \( t \) is embedded (e.g. sinusoidal) and injected into the U-Net at each resolution level via adaptive group norm or cross-attention.
        

The Forward Process (Fixed)

Define $ \alpha_t = 1 - \beta_t $ and $ \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s $. The forward process adds noise at each step:

q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{\alpha_t}\, x_{t-1},\, \beta_t I\right)

The key property is that we can sample $ x_t $ directly from $ x_0 $ in closed form:

q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I\right)

which means:

x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This closed-form expression is what makes training efficient — we can jump to any timestep in one shot without stepping through all $ T $ steps.

Training Objective

The full variational lower bound on $ \log p(x_0) $ simplifies (Ho et al., 2020) to a weighted denoising objective. In practice the simplified loss is used:

\mathcal{L}_{\text{simple}} = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\left\|\epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon,\; t\right)\right\|^2\right]

The model is trained to predict the noise $ \epsilon $ that was added to $ x_0 $ to produce the noisy sample $ x_t $. This is a standard mean-squared error loss.

Training Process

Step 1 — Sample a clean data point

Draw a sample from the training set:

x_0 \sim q(x_0)

Step 2 — Sample a random timestep

Draw a timestep uniformly:

t \sim \mathcal{U}(\{1, \ldots, T\})

Step 3 — Sample noise and create noisy input

Draw standard Gaussian noise and corrupt $ x_0 $ using the closed-form forward process:

\epsilon \sim \mathcal{N}(0, I)$$ $$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon

Step 4 — Forward pass through denoising network

Pass $ x_t $ and the timestep $ t $ through the U-Net:

\hat{\epsilon} = \epsilon_\theta(x_t, t)

The network receives both the noisy image and the timestep so it knows how much noise to expect.

Step 5 — Compute loss

\mathcal{L} = \|\epsilon - \hat{\epsilon}\|^2

Step 6 — Backpropagate and update

Standard gradient descent on the denoising network parameters $ \theta $:

\theta \leftarrow \theta - \alpha\, \nabla_\theta \mathcal{L}

Training Loop Summary

# Precompute schedule
β = linear_schedule(β_start, β_end, T)
ᾱ = cumprod(1 - β)

for step in training_steps:
    x0 = sample_batch(dataset)

    # 1. Random timestep
    t = randint(1, T, size=batch_size)

    # 2. Sample noise
    ε = randn_like(x0)

    # 3. Create noisy input (closed-form)
    x_t = sqrt(ᾱ[t]) * x0 + sqrt(1 - ᾱ[t]) * ε

    # 4. Predict noise
    ε_hat = unet(x_t, t)

    # 5. Loss and update
    loss = mse(ε, ε_hat)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Reverse Process (Sampling / Inference)

After training, generate new samples by starting from pure noise $ x_T \sim \mathcal{N}(0, I) $ and iteratively denoising:

x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\epsilon_\theta(x_t, t)\right) + \sigma_t z

where $ z \sim \mathcal{N}(0,I) $ for $ t > 1 $ and $ z = 0 $ for $ t = 1 $, and $ \sigma_t^2 = \beta_t $ (or a learned variance).

Sampling requires $ T $ sequential forward passes — typically 1000 steps for DDPM, or as few as 10–50 with accelerated samplers (DDIM, DPM-Solver).

Noise Schedules

          Linear schedule (Ho et al., 2020): \( \beta_t \) increases linearly from \( 10^{-4} \) to \( 0.02 \). Works well for natural images but adds noise too quickly for pixel values near 0/1.
        

          Cosine schedule (Nichol & Dhariwal, 2021):
          \( \bar{\alpha}_t = \cos^2\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right) \).
          Slower noise addition near \( t=0 \) and \( t=T \), improving training stability and sample quality.
        

Conditional Generation

To condition generation on a label $ y $ (e.g. class label, text embedding):

Classifier guidance: Train a noisy classifier $ p_\phi(y \mid x_t) $ separately. At sampling, add $ \nabla_{x_t} \log p_\phi(y \mid x_t) $ to the score — no retraining of the diffusion model needed.
Classifier-free guidance (Ho & Salimans, 2021): Train a single network jointly as conditional $ \epsilon_\theta(x_t, t, y) $ and unconditional $ \epsilon_\theta(x_t, t, \varnothing) $ by randomly dropping the condition. At sampling, interpolate: $\hat{\epsilon} = \epsilon_\theta(x_t, t, \varnothing) + w \cdot \left(\epsilon_\theta(x_t, t, y) - \epsilon_\theta(x_t, t, \varnothing)\right)$ where $ w \ge 1 $ is the guidance scale. Higher $ w $ increases adherence to the condition at the cost of diversity. This is the approach used in Stable Diffusion, DALL·E 2, and Imagen.

Key Insights

Simple loss, powerful model: The denoising MSE objective is stable to optimize compared to GAN adversarial training or VAE posterior collapse.
Random timestep sampling: Each training step uses a different noise level, so the model learns to denoise across the entire spectrum in one training run.
Closed-form noising is the efficiency trick: $ q(x_t \mid x_0) $ in closed form means no sequential forward steps during training.
Slow sampling is the main cost: Generating one sample requires hundreds of sequential network passes. Accelerated samplers (DDIM, DPM-Solver) reduce this significantly.
Latent diffusion (LDM): Stable Diffusion operates the diffusion process in a compressed latent space (via a pretrained VAE encoder/decoder), drastically reducing compute while maintaining quality.

Common Issues & Solutions

          Problem: Training instability or NaN losses

          Solution: Gradient clipping; use cosine schedule instead of linear; ensure \( \bar{\alpha}_T \approx 0 \)

          Problem: Blurry or low-quality samples

          Solution: Increase \( T \); use cosine schedule; add more U-Net capacity (channels, attention layers)

          Problem: Slow sampling

          Solution: Use DDIM (deterministic, 50 steps) or DPM-Solver (as few as 10–20 steps) at inference — no retraining required

          Problem: Conditioning not followed (low guidance weight)

          Solution: Increase guidance scale \( w \); ensure conditioning dropout rate during training is ~10–20%

References

Ho et al. (2020): "Denoising Diffusion Probabilistic Models" (DDPM)
Song & Ermon (2020): "Score-Based Generative Modeling Through Stochastic Differential Equations"
Nichol & Dhariwal (2021): "Improved Denoising Diffusion Probabilistic Models"
Ho & Salimans (2021): "Classifier-Free Diffusion Guidance"
Song et al. (2022): "Denoising Diffusion Implicit Models" (DDIM)
Rombach et al. (2022): "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion)

← Back to Model Training