Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel Year: 2020 Source: arXiv:2006.11239
This paper shows that a neural network trained to gradually remove noise from images – reversing a process that slowly corrupts clean images into pure static – can generate remarkably realistic new images from scratch.
By 2020, several families of generative models could produce images: GANs (see Generative Adversarial Nets) could generate sharp images but suffered from training instability and mode collapse. VAEs (see Auto-Encoding Variational Bayes) were stable to train but produced blurry outputs. Autoregressive models generated high-quality samples but were painfully slow because they produced images one pixel at a time.
A class of models called diffusion probabilistic models had been proposed in 2015 by Sohl-Dickstein et al., drawing on ideas from nonequilibrium thermodynamics (the physics of systems not in steady state). The idea was appealing: define a simple process that gradually destroys data by adding noise, then learn to reverse that process. But nobody had shown these models could actually produce images competitive with GANs or other state-of-the-art approaches. The original diffusion models produced poor-quality samples and had no clear path to improvement.
The core challenge was threefold. First, the model must learn to reverse 1000 steps of noise addition – each step slightly different – which means learning a complex, time-dependent denoising function. Second, training such a model required optimizing a variational bound (a tractable approximation to the true objective) that had many terms, and it was unclear which parameterization of the model would work best. Third, even if training succeeded, sampling required running the full 1000-step reverse chain, making generation slow compared to single-shot approaches like GANs.
Think of this paper’s method like restoring a photograph that has been progressively damaged. Imagine you take a crisp photograph and make 1000 photocopies in sequence, where each copy introduces a tiny amount of static. By copy 1000, the image is pure static – no trace of the original remains. Now imagine you train someone to look at any intermediate copy and guess what the previous (slightly less noisy) copy looked like. If that person gets good enough at this one-step cleanup, you can hand them pure static and ask them to clean it up 1000 times in a row, and they will produce a photorealistic image that never existed before.
The paper’s main technical insight is a specific way to parameterize what the neural network predicts at each denoising step. Rather than having the network directly predict the cleaned-up image or the mean of the previous step’s distribution, the authors show it is better to have the network predict the noise itself – the random static that was added. This noise-prediction parameterization (\(\epsilon\)-prediction) turns out to be mathematically equivalent to a technique called denoising score matching combined with Langevin dynamics (a physics-inspired sampling method). This equivalence is not just a theoretical curiosity: it leads to a dramatically simplified training objective where you simply minimize the difference between the actual noise that was added and the noise the network predicts.
The second key contribution is the simplified training objective \(L_\text{simple}\), which drops the per-timestep weighting that the full variational bound prescribes. This means the network focuses more on the harder denoising tasks (removing large amounts of noise) and less on the easy ones (removing tiny amounts of noise from nearly clean images). Counterintuitively, this produces better samples even though it makes the log-likelihood slightly worse.
The diffusion model operates in two phases: a forward process that destroys data and a reverse process that creates it.
The forward process takes a clean image \(x_0\) and adds Gaussian noise in \(T = 1000\) small steps. At each step \(t\), the image \(x_t\) is produced by scaling down the previous image slightly and adding a small amount of noise. The variance schedule \(\beta_1, \ldots, \beta_T\) controls how much noise is added at each step, increasing linearly from \(\beta_1 = 10^{-4}\) to \(\beta_T = 0.02\). After all 1000 steps, the result \(x_T\) is nearly indistinguishable from pure Gaussian noise.
Figure 7: When multiple images are generated from the same intermediate noisy latent \(x_t\), they share high-level attributes but differ in details. At \(t=1000\) (leftmost), the shared latent is nearly pure noise, so generated images differ completely. At \(t=250\) (rightmost), the latent preserves most structure, so generated images differ only in fine details. The bottom-right quadrant of each group shows the noisy intermediate \(x_t\); the other three quadrants show different samples generated from it. This demonstrates that the reverse process progressively resolves large-scale structure first and fine details last.
A crucial property is that you do not need to actually run all \(t\) steps to get \(x_t\). Thanks to the math of Gaussian distributions, you can jump directly from \(x_0\) to any \(x_t\) in one step: \(x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon\), where \(\epsilon\) is fresh noise and \(\bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s)\). This shortcut makes training efficient because you can pick a random timestep, generate the noisy version instantly, and train the network on that single step.
The reverse process learns to go backward: given a noisy image \(x_t\), produce a slightly less noisy image \(x_{t-1}\). Each reverse step is modeled as a Gaussian distribution whose mean depends on the neural network’s output. The authors choose to have the network \(\epsilon_\theta(x_t, t)\) predict the noise component \(\epsilon\) that was added. From this prediction, the reverse step computes:
\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z\]
where \(z\) is fresh random noise (set to zero at the final step \(t=1\)) and \(\sigma_t\) is a fixed variance.
Figure 6: Progressive generation on CIFAR10, showing predicted clean images \(\hat{x}_0\) at various points during the 1000-step reverse process (left to right). The model first resolves large-scale structure like overall color and shape, then progressively adds finer details. Each row shows a different sample being generated from pure noise.
The neural network architecture is a U-Net (a convolutional network shaped like the letter U, where the image is first downsampled through several resolution levels, then upsampled back to full resolution with skip connections bridging corresponding levels). The network uses group normalization, self-attention at 16x16 resolution, and Transformer-style sinusoidal position embeddings (see Attention Is All You Need) to tell the network which timestep \(t\) it is denoising. The same network handles all 1000 timesteps, with \(t\) provided as a conditioning input.
Training is remarkably simple: sample a clean image from the dataset, pick a random timestep \(t\), generate the corresponding noisy version using the closed-form shortcut, and minimize the squared error between the true noise \(\epsilon\) and the network’s prediction \(\epsilon_\theta\). Concretely, for a training image of a horse from CIFAR10 (a 32x32x3 image), if we pick \(t = 500\) and sample noise \(\epsilon \sim \mathcal{N}(0, I)\), we compute \(x_{500} = \sqrt{\bar{\alpha}_{500}}\, x_0 + \sqrt{1 - \bar{\alpha}_{500}}\, \epsilon\), feed \(x_{500}\) and \(t = 500\) to the network, and update its weights to make \(\epsilon_\theta(x_{500}, 500)\) closer to \(\epsilon\).
Sampling (generating a new image) starts from pure noise \(x_T \sim \mathcal{N}(0, I)\) and runs the reverse process for all \(T\) steps, producing \(x_{T-1}, x_{T-2}, \ldots, x_0\). The final \(x_0\) is the generated image.
\[q(x_t | x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I)\]
In plain language: each forward step takes the current image, shrinks it by a tiny amount, and adds a small amount of random Gaussian noise. The shrinking prevents the total variance from growing without bound. After enough steps, the cumulative effect destroys all information in the original image.
This matters because the forward process is the foundation everything else is built on. By choosing it to be a simple Gaussian perturbation, the authors ensure that all the math stays tractable – the posterior distributions needed for training have closed-form solutions.
\[q(x_t | x_0) = \mathcal{N}(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I)\]
In plain language: instead of running the forward process step by step 500 times to get \(x_{500}\), you can jump there directly. The original image is just scaled down by \(\sqrt{\bar{\alpha}_t}\) and combined with noise scaled by \(\sqrt{1 - \bar{\alpha}_t}\). For example, with \(T = 1000\) and the linear schedule from this paper, at \(t = 500\) roughly \(\bar{\alpha}_{500} \approx 0.044\), meaning about 96% of the variance comes from noise.
This matters because it makes training efficient. Without this shortcut, computing the loss for timestep \(t\) would require running \(t\) sequential forward steps, making training \(O(T)\) per sample instead of \(O(1)\).
\[\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)\]
In plain language: the network estimates the noise \(\epsilon\) that was added. This equation converts that noise estimate into a prediction of where \(x_{t-1}\) should be centered. It subtracts a scaled version of the predicted noise from \(x_t\) and rescales, effectively removing one step’s worth of corruption.
This matters because the authors tried three parameterizations – predicting \(x_0\) directly, predicting the posterior mean \(\tilde{\mu}_t\), and predicting \(\epsilon\) – and the noise prediction worked best. The reason is connected to denoising score matching: predicting \(\epsilon\) is equivalent to estimating the gradient of the log probability density (the “score”) of the noisy data distribution, which has deep theoretical roots.
\[L_\text{simple}(\theta) = \mathbb{E}_{t,\, x_0,\, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t\right) \right\|^2 \right]\]
In plain language: pick a random image from the dataset, pick a random noise level, add noise, and train the network to predict what noise was added. The loss is just the squared difference between the true noise and the predicted noise, averaged over all choices.
This matters because the full variational bound (the theoretically justified objective) includes a per-timestep weighting factor \(\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)}\) that upweights small-\(t\) terms. Dropping this weighting and treating all timesteps equally (the “simple” objective) produces better image quality at the cost of slightly worse log-likelihoods. The simplified loss down-weights the easy denoising tasks (small \(t\), little noise) so the network focuses on the harder ones.
\[L_{t-1} = \mathbb{E}_{x_0, \epsilon} \left[ \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t\right) \right\|^2 \right]\]
In plain language: this is the “full” version of the loss for one timestep. The weighting factor comes from the KL divergence between the true reverse posterior and the learned reverse step. The simplified objective (\(L_\text{simple}\)) removes this weight, replacing it with uniform weighting across all timesteps.
This matters because it shows the connection between diffusion models and denoising score matching. The expression \(\|\epsilon - \epsilon_\theta(\cdot)\|^2\) is the same core computation used in score-matching methods (NCSN), but here it arises naturally from the variational bound of a latent variable model. This unifies two previously separate lines of research.
The paper evaluates on CIFAR10 (32x32 color images, 10 object categories) and LSUN (256x256 images of bedrooms, churches, and cats), using two standard metrics: FID (Frechet Inception Distance, where lower is better – it measures how similar generated images are to real ones in a feature space) and IS (Inception Score, where higher is better – it measures both quality and diversity of generated images).
| Model | IS | FID | Type |
|---|---|---|---|
| StyleGAN2 + ADA | 9.74 | 3.26 | Unconditional |
| DDPM (this paper, \(L_\text{simple}\)) | 9.46 | 3.17 | Unconditional |
| NCSN | 8.87 | 25.32 | Unconditional |
| SNGAN-DDLS | 9.09 | 15.42 | Unconditional |
| BigGAN | 9.22 | 14.73 | Conditional |
On CIFAR10, the DDPM achieves an FID of 3.17, which was state-of-the-art for unconditional models and competitive even with class-conditional models (which have access to label information the unconditional model does not). The Inception Score of 9.46 is also strong, trailing only the conditional StyleGAN2 + ADA.
On LSUN 256x256, the model achieves FID scores of 4.90 (bedrooms, large model), 7.89 (churches), and 19.75 (cats), which are competitive with ProgressiveGAN but behind StyleGAN and StyleGAN2. The cat dataset proved most difficult, likely because cats have more fine-grained structural variation than bedrooms or churches.
Figure 1: Four faces generated by the DDPM on CelebA-HQ at 256x256 resolution. These images were not cherry-picked. The model captures diverse skin tones, facial expressions, hair styles, and lighting conditions, demonstrating that the iterative denoising approach produces photorealistic results competitive with GAN-based methods.
An important finding from the ablation study: the \(\epsilon\)-prediction parameterization with the simplified objective dramatically outperforms all other combinations (FID 3.17 vs. 13.22 for the next best). Predicting \(\tilde{\mu}\) directly works only with the full variational bound, and learning the reverse process variance leads to training instability.
The paper also finds that diffusion models trade off log-likelihood for sample quality. Training on the full variational bound gives better lossless codelengths (3.70 bits/dim) but worse FID (13.51), while the simplified objective gives worse codelengths (3.75 bits/dim) but much better FID (3.17). The authors show that more than half of the model’s lossless codelength is spent encoding imperceptible image details – the model is an excellent lossy compressor but a mediocre lossless one.
This paper is arguably the single most influential work in the diffusion model lineage. While Sohl-Dickstein et al. (2015) introduced diffusion probabilistic models and Song and Ermon (2019) connected score matching to iterative refinement, Ho et al. demonstrated for the first time that diffusion models could compete with GANs on image quality. The key recipe – \(\epsilon\)-prediction, simplified loss, U-Net architecture, linear noise schedule – became the standard starting point for nearly all subsequent diffusion work.
The impact was enormous. DDPM directly spawned a wave of follow-up work that transformed generative AI. Nichol and Dhariwal (2021) introduced improved DDPMs with learned variance schedules and faster sampling. Song et al. (2020) developed DDIM, showing that deterministic sampling could reduce the number of steps from 1000 to as few as 50. Dhariwal and Nichol (2021) showed diffusion models could beat GANs with classifier guidance. Rombach et al. (2022) combined diffusion with VAE latent spaces to create Latent Diffusion Models, which became the foundation for Stable Diffusion, one of the most widely used image generation systems.
Beyond images, the DDPM framework has been adapted to audio generation (WaveGrad, DiffWave), video synthesis, 3D shape generation, molecular design, and protein structure prediction. The core insight – that iteratively denoising is a powerful generative paradigm – has proven remarkably general. By 2023, diffusion-based systems (DALL-E 2, Midjourney, Stable Diffusion, Imagen) had become the dominant approach for text-to-image generation, largely displacing GANs for this task. The simplicity and stability of DDPM’s training procedure – just predict the noise – was a decisive advantage over the notoriously finicky adversarial training of GANs.
To fully understand this paper, a reader should be comfortable with: