Denoising Diffusion Probabilistic Models

Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel Year: 2020 Source: arXiv:2006.11239

One-Sentence Summary

This paper shows that a neural network trained to gradually remove noise from images – reversing a process that slowly corrupts clean images into pure static – can generate remarkably realistic new images from scratch.

Problem Statement

By 2020, several families of generative models could produce images: GANs (see Generative Adversarial Nets) could generate sharp images but suffered from training instability and mode collapse. VAEs (see Auto-Encoding Variational Bayes) were stable to train but produced blurry outputs. Autoregressive models generated high-quality samples but were painfully slow because they produced images one pixel at a time.

A class of models called diffusion probabilistic models had been proposed in 2015 by Sohl-Dickstein et al., drawing on ideas from nonequilibrium thermodynamics (the physics of systems not in steady state). The idea was appealing: define a simple process that gradually destroys data by adding noise, then learn to reverse that process. But nobody had shown these models could actually produce images competitive with GANs or other state-of-the-art approaches. The original diffusion models produced poor-quality samples and had no clear path to improvement.

The core challenge was threefold. First, the model must learn to reverse 1000 steps of noise addition – each step slightly different – which means learning a complex, time-dependent denoising function. Second, training such a model required optimizing a variational bound (a tractable approximation to the true objective) that had many terms, and it was unclear which parameterization of the model would work best. Third, even if training succeeded, sampling required running the full 1000-step reverse chain, making generation slow compared to single-shot approaches like GANs.

Key Innovation

Think of this paper’s method like restoring a photograph that has been progressively damaged. Imagine you take a crisp photograph and make 1000 photocopies in sequence, where each copy introduces a tiny amount of static. By copy 1000, the image is pure static – no trace of the original remains. Now imagine you train someone to look at any intermediate copy and guess what the previous (slightly less noisy) copy looked like. If that person gets good enough at this one-step cleanup, you can hand them pure static and ask them to clean it up 1000 times in a row, and they will produce a photorealistic image that never existed before.

The paper’s main technical insight is a specific way to parameterize what the neural network predicts at each denoising step. Rather than having the network directly predict the cleaned-up image or the mean of the previous step’s distribution, the authors show it is better to have the network predict the noise itself – the random static that was added. This noise-prediction parameterization (\(\epsilon\)-prediction) turns out to be mathematically equivalent to a technique called denoising score matching combined with Langevin dynamics (a physics-inspired sampling method). This equivalence is not just a theoretical curiosity: it leads to a dramatically simplified training objective where you simply minimize the difference between the actual noise that was added and the noise the network predicts.

The second key contribution is the simplified training objective \(L_\text{simple}\), which drops the per-timestep weighting that the full variational bound prescribes. This means the network focuses more on the harder denoising tasks (removing large amounts of noise) and less on the easy ones (removing tiny amounts of noise from nearly clean images). Counterintuitively, this produces better samples even though it makes the log-likelihood slightly worse.

Architecture / Method

The diffusion model operates in two phases: a forward process that destroys data and a reverse process that creates it.

The forward process takes a clean image \(x_0\) and adds Gaussian noise in \(T = 1000\) small steps. At each step \(t\), the image \(x_t\) is produced by scaling down the previous image slightly and adding a small amount of noise. The variance schedule \(\beta_1, \ldots, \beta_T\) controls how much noise is added at each step, increasing linearly from \(\beta_1 = 10^{-4}\) to \(\beta_T = 0.02\). After all 1000 steps, the result \(x_T\) is nearly indistinguishable from pure Gaussian noise.

Stochastic decoding from shared latent variables at different noise levels

Figure 7: When multiple images are generated from the same intermediate noisy latent \(x_t\), they share high-level attributes but differ in details. At \(t=1000\) (leftmost), the shared latent is nearly pure noise, so generated images differ completely. At \(t=250\) (rightmost), the latent preserves most structure, so generated images differ only in fine details. The bottom-right quadrant of each group shows the noisy intermediate \(x_t\); the other three quadrants show different samples generated from it. This demonstrates that the reverse process progressively resolves large-scale structure first and fine details last.

A crucial property is that you do not need to actually run all \(t\) steps to get \(x_t\). Thanks to the math of Gaussian distributions, you can jump directly from \(x_0\) to any \(x_t\) in one step: \(x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon\), where \(\epsilon\) is fresh noise and \(\bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s)\). This shortcut makes training efficient because you can pick a random timestep, generate the noisy version instantly, and train the network on that single step.

The reverse process learns to go backward: given a noisy image \(x_t\), produce a slightly less noisy image \(x_{t-1}\). Each reverse step is modeled as a Gaussian distribution whose mean depends on the neural network’s output. The authors choose to have the network \(\epsilon_\theta(x_t, t)\) predict the noise component \(\epsilon\) that was added. From this prediction, the reverse step computes:

\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z\]

where \(z\) is fresh random noise (set to zero at the final step \(t=1\)) and \(\sigma_t\) is a fixed variance.

Progressive generation from noise to image on CIFAR10

Figure 6: Progressive generation on CIFAR10, showing predicted clean images \(\hat{x}_0\) at various points during the 1000-step reverse process (left to right). The model first resolves large-scale structure like overall color and shape, then progressively adds finer details. Each row shows a different sample being generated from pure noise.

The neural network architecture is a U-Net (a convolutional network shaped like the letter U, where the image is first downsampled through several resolution levels, then upsampled back to full resolution with skip connections bridging corresponding levels). The network uses group normalization, self-attention at 16x16 resolution, and Transformer-style sinusoidal position embeddings (see Attention Is All You Need) to tell the network which timestep \(t\) it is denoising. The same network handles all 1000 timesteps, with \(t\) provided as a conditioning input.

Training is remarkably simple: sample a clean image from the dataset, pick a random timestep \(t\), generate the corresponding noisy version using the closed-form shortcut, and minimize the squared error between the true noise \(\epsilon\) and the network’s prediction \(\epsilon_\theta\). Concretely, for a training image of a horse from CIFAR10 (a 32x32x3 image), if we pick \(t = 500\) and sample noise \(\epsilon \sim \mathcal{N}(0, I)\), we compute \(x_{500} = \sqrt{\bar{\alpha}_{500}}\, x_0 + \sqrt{1 - \bar{\alpha}_{500}}\, \epsilon\), feed \(x_{500}\) and \(t = 500\) to the network, and update its weights to make \(\epsilon_\theta(x_{500}, 500)\) closer to \(\epsilon\).

Sampling (generating a new image) starts from pure noise \(x_T \sim \mathcal{N}(0, I)\) and runs the reverse process for all \(T\) steps, producing \(x_{T-1}, x_{T-2}, \ldots, x_0\). The final \(x_0\) is the generated image.

Mathematical Foundations

The Forward Process

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I)\]

\(q(x_t | x_{t-1})\): the probability distribution over the noisy image at step \(t\), given the image at step \(t-1\)
\(\beta_t\): the noise variance at step \(t\), a small number that increases from \(10^{-4}\) to \(0.02\) over 1000 steps
\(\sqrt{1-\beta_t}\): a scaling factor slightly less than 1 that shrinks the signal at each step
\(I\): the identity matrix (meaning noise is added independently to each pixel)
\(\mathcal{N}(\cdot\,;\, \mu, \sigma^2 I)\): a Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\) along each dimension

In plain language: each forward step takes the current image, shrinks it by a tiny amount, and adds a small amount of random Gaussian noise. The shrinking prevents the total variance from growing without bound. After enough steps, the cumulative effect destroys all information in the original image.

This matters because the forward process is the foundation everything else is built on. By choosing it to be a simple Gaussian perturbation, the authors ensure that all the math stays tractable – the posterior distributions needed for training have closed-form solutions.

Closed-Form Sampling at Any Timestep

\[q(x_t | x_0) = \mathcal{N}(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I)\]

\(\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s\) where \(\alpha_s = 1 - \beta_s\): the cumulative product of all the per-step scaling factors up to step \(t\)
\(\sqrt{\bar{\alpha}_t}\): how much of the original signal survives after \(t\) steps (close to 1 for small \(t\), close to 0 for large \(t\))
\((1 - \bar{\alpha}_t)\): how much noise has accumulated after \(t\) steps

In plain language: instead of running the forward process step by step 500 times to get \(x_{500}\), you can jump there directly. The original image is just scaled down by \(\sqrt{\bar{\alpha}_t}\) and combined with noise scaled by \(\sqrt{1 - \bar{\alpha}_t}\). For example, with \(T = 1000\) and the linear schedule from this paper, at \(t = 500\) roughly \(\bar{\alpha}_{500} \approx 0.044\), meaning about 96% of the variance comes from noise.

This matters because it makes training efficient. Without this shortcut, computing the loss for timestep \(t\) would require running \(t\) sequential forward steps, making training \(O(T)\) per sample instead of \(O(1)\).

The Noise Prediction Parameterization

\[\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)\]

\(\mu_\theta(x_t, t)\): the predicted mean of the reverse step distribution \(p_\theta(x_{t-1} | x_t)\)
\(\epsilon_\theta(x_t, t)\): the neural network’s prediction of the noise that was added to create \(x_t\)
\(\frac{1}{\sqrt{\alpha_t}}\): a rescaling factor that undoes one step of the forward process shrinking
\(\frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\): the coefficient that converts noise prediction into a correction to the mean

In plain language: the network estimates the noise \(\epsilon\) that was added. This equation converts that noise estimate into a prediction of where \(x_{t-1}\) should be centered. It subtracts a scaled version of the predicted noise from \(x_t\) and rescales, effectively removing one step’s worth of corruption.

This matters because the authors tried three parameterizations – predicting \(x_0\) directly, predicting the posterior mean \(\tilde{\mu}_t\), and predicting \(\epsilon\) – and the noise prediction worked best. The reason is connected to denoising score matching: predicting \(\epsilon\) is equivalent to estimating the gradient of the log probability density (the “score”) of the noisy data distribution, which has deep theoretical roots.

The Simplified Training Objective

\[L_\text{simple}(\theta) = \mathbb{E}_{t,\, x_0,\, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t\right) \right\|^2 \right]\]

\(t \sim \text{Uniform}(\{1, \ldots, T\})\): a randomly chosen timestep
\(x_0 \sim q(x_0)\): a training image sampled from the dataset
\(\epsilon \sim \mathcal{N}(0, I)\): random noise sampled fresh each time
\(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon\): the noisy image at timestep \(t\), computed using the closed-form shortcut
\(\epsilon_\theta(\cdot, t)\): the neural network evaluated at the noisy input and timestep \(t\)
\(\|\cdot\|^2\): the squared Euclidean distance (mean squared error)

In plain language: pick a random image from the dataset, pick a random noise level, add noise, and train the network to predict what noise was added. The loss is just the squared difference between the true noise and the predicted noise, averaged over all choices.

This matters because the full variational bound (the theoretically justified objective) includes a per-timestep weighting factor \(\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)}\) that upweights small-\(t\) terms. Dropping this weighting and treating all timesteps equally (the “simple” objective) produces better image quality at the cost of slightly worse log-likelihoods. The simplified loss down-weights the easy denoising tasks (small \(t\), little noise) so the network focuses on the harder ones.

The Weighted Variational Bound Term

\[L_{t-1} = \mathbb{E}_{x_0, \epsilon} \left[ \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t\right) \right\|^2 \right]\]

\(\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)}\): the per-timestep weight derived from the variational bound. For small \(t\), \(\beta_t\) is tiny and \(\bar{\alpha}_t\) is close to 1, making this weight large. For large \(t\), the weight is smaller.
\(\sigma_t^2\): the fixed variance of the reverse process step (either \(\beta_t\) or \(\tilde{\beta}_t\))

In plain language: this is the “full” version of the loss for one timestep. The weighting factor comes from the KL divergence between the true reverse posterior and the learned reverse step. The simplified objective (\(L_\text{simple}\)) removes this weight, replacing it with uniform weighting across all timesteps.

This matters because it shows the connection between diffusion models and denoising score matching. The expression \(\|\epsilon - \epsilon_\theta(\cdot)\|^2\) is the same core computation used in score-matching methods (NCSN), but here it arises naturally from the variational bound of a latent variable model. This unifies two previously separate lines of research.

Results

The paper evaluates on CIFAR10 (32x32 color images, 10 object categories) and LSUN (256x256 images of bedrooms, churches, and cats), using two standard metrics: FID (Frechet Inception Distance, where lower is better – it measures how similar generated images are to real ones in a feature space) and IS (Inception Score, where higher is better – it measures both quality and diversity of generated images).

Model	IS	FID	Type
StyleGAN2 + ADA	9.74	3.26	Unconditional
DDPM (this paper, \(L_\text{simple}\))	9.46	3.17	Unconditional
NCSN	8.87	25.32	Unconditional
SNGAN-DDLS	9.09	15.42	Unconditional
BigGAN	9.22	14.73	Conditional

On CIFAR10, the DDPM achieves an FID of 3.17, which was state-of-the-art for unconditional models and competitive even with class-conditional models (which have access to label information the unconditional model does not). The Inception Score of 9.46 is also strong, trailing only the conditional StyleGAN2 + ADA.

On LSUN 256x256, the model achieves FID scores of 4.90 (bedrooms, large model), 7.89 (churches), and 19.75 (cats), which are competitive with ProgressiveGAN but behind StyleGAN and StyleGAN2. The cat dataset proved most difficult, likely because cats have more fine-grained structural variation than bedrooms or churches.

High-quality face images generated by the diffusion model

Figure 1: Four faces generated by the DDPM on CelebA-HQ at 256x256 resolution. These images were not cherry-picked. The model captures diverse skin tones, facial expressions, hair styles, and lighting conditions, demonstrating that the iterative denoising approach produces photorealistic results competitive with GAN-based methods.

An important finding from the ablation study: the \(\epsilon\)-prediction parameterization with the simplified objective dramatically outperforms all other combinations (FID 3.17 vs. 13.22 for the next best). Predicting \(\tilde{\mu}\) directly works only with the full variational bound, and learning the reverse process variance leads to training instability.

The paper also finds that diffusion models trade off log-likelihood for sample quality. Training on the full variational bound gives better lossless codelengths (3.70 bits/dim) but worse FID (13.51), while the simplified objective gives worse codelengths (3.75 bits/dim) but much better FID (3.17). The authors show that more than half of the model’s lossless codelength is spent encoding imperceptible image details – the model is an excellent lossy compressor but a mediocre lossless one.

Limitations

Slow sampling. Generating a single image requires 1000 sequential neural network evaluations (the full reverse chain). For 256x256 images, sampling a batch of 128 images takes about 300 seconds. GANs generate images in a single forward pass.
Log-likelihoods are not competitive. While sample quality exceeds most models, the lossless codelengths (3.75 bits/dim on CIFAR10) are worse than autoregressive models like Sparse Transformer (2.80 bits/dim). The paper acknowledges this but argues diffusion models are better lossy compressors.
No conditional generation. The paper only demonstrates unconditional image generation. There is no mechanism for text-to-image, class-conditional generation, or any form of guided synthesis. (Later work by Dhariwal and Nichol in 2021 would address this with classifier guidance.)
Fixed noise schedule. The linear schedule from \(\beta_1 = 10^{-4}\) to \(\beta_T = 0.02\) was found by manual search. The paper does not explore learned or adaptive schedules, which later work showed can significantly improve performance.
Limited resolution. The best results are at 256x256 resolution. The paper does not address scaling to higher resolutions (512x512, 1024x1024), which would later require architectural changes and latent-space diffusion.
No theoretical analysis of sample quality. The paper does not explain why the simplified objective produces better samples than the full variational bound, beyond the intuition about down-weighting easy denoising tasks. This remains somewhat mysterious.
Evaluation metrics are limited. FID and IS both rely on a pretrained Inception network and can miss certain failure modes. The paper does not evaluate perceptual quality with human studies.

Impact and Legacy

This paper is arguably the single most influential work in the diffusion model lineage. While Sohl-Dickstein et al. (2015) introduced diffusion probabilistic models and Song and Ermon (2019) connected score matching to iterative refinement, Ho et al. demonstrated for the first time that diffusion models could compete with GANs on image quality. The key recipe – \(\epsilon\)-prediction, simplified loss, U-Net architecture, linear noise schedule – became the standard starting point for nearly all subsequent diffusion work.

The impact was enormous. DDPM directly spawned a wave of follow-up work that transformed generative AI. Nichol and Dhariwal (2021) introduced improved DDPMs with learned variance schedules and faster sampling. Song et al. (2020) developed DDIM, showing that deterministic sampling could reduce the number of steps from 1000 to as few as 50. Dhariwal and Nichol (2021) showed diffusion models could beat GANs with classifier guidance. Rombach et al. (2022) combined diffusion with VAE latent spaces to create Latent Diffusion Models, which became the foundation for Stable Diffusion, one of the most widely used image generation systems.

Beyond images, the DDPM framework has been adapted to audio generation (WaveGrad, DiffWave), video synthesis, 3D shape generation, molecular design, and protein structure prediction. The core insight – that iteratively denoising is a powerful generative paradigm – has proven remarkably general. By 2023, diffusion-based systems (DALL-E 2, Midjourney, Stable Diffusion, Imagen) had become the dominant approach for text-to-image generation, largely displacing GANs for this task. The simplicity and stability of DDPM’s training procedure – just predict the noise – was a decisive advantage over the notoriously finicky adversarial training of GANs.

Prerequisites

To fully understand this paper, a reader should be comfortable with:

Probability distributions: Gaussian (normal) distributions, conditional probability, Bayes’ theorem, expectations, KL divergence (a measure of how different two probability distributions are)
Markov chains: sequences where each state depends only on the previous one
Variational inference: the idea of optimizing a tractable lower bound (the ELBO) on an intractable log-likelihood – covered in detail in (see Auto-Encoding Variational Bayes)
Neural network basics: what a loss function is, how gradient descent updates parameters, what a convolutional neural network does to images
U-Net architecture: an encoder-decoder convolutional network with skip connections (introduced by Ronneberger et al., 2015 for image segmentation)

Connections

Builds on VAEs (see Auto-Encoding Variational Bayes): DDPM uses the same variational inference framework as VAEs – optimizing an evidence lower bound (ELBO) on the data log-likelihood. The forward process \(q(x_{1:T}|x_0)\) plays the role of the VAE’s encoder, and the reverse process \(p_\theta(x_{0:T})\) plays the role of the decoder. Unlike VAEs, the encoder is fixed (not learned) and the latent space has the same dimensionality as the data.
Contrasts with GANs (see Generative Adversarial Nets): Both DDPM and GANs generate images, but through fundamentally different mechanisms. GANs use adversarial training with a discriminator, leading to sharp but sometimes unstable and mode-collapsed outputs. DDPMs use a simple MSE loss with stable training but much slower sampling. DDPM matches GAN quality on CIFAR10 without any adversarial training.
Uses Transformer components (see Attention Is All You Need): The DDPM architecture borrows two ideas from the Transformer paper: sinusoidal position embeddings (repurposed to encode the diffusion timestep \(t\) instead of token position) and self-attention layers (applied at 16x16 resolution within the U-Net).
Relates to ViT through shared building blocks (see An Image is Worth 16x16 Words): While ViT applies Transformers to image classification, DDPM applies Transformer components (attention, positional encoding) within a convolutional architecture for image generation. Later diffusion models (DiT, 2023) would replace the U-Net entirely with a Transformer, directly combining the ViT and DDPM lineages.
Score matching connection: The \(\epsilon\)-prediction parameterization is mathematically equivalent to estimating the score function \(\nabla_x \log p(x)\) used in score-based generative models (NCSN, Song and Ermon 2019). DDPM shows this connection arises naturally from variational inference, unifying two research threads.