course

Why would you deliberately ruin an image by adding static? Because if you can learn to undo that damage – to look at a noisy image and predict what the clean version looked like – you have learned something deep about the structure of images. And if you can do that starting from pure static, you can create images that never existed. The entire diffusion framework rests on one specific type of noise: the Gaussian distribution.

Explanation

A Gaussian distribution (also called a normal distribution) is a bell-shaped probability distribution defined by two parameters: a mean \(\mu\) (the center) and a variance \(\sigma^2\) (how spread out it is). We write this as \(\mathcal{N}(\mu, \sigma^2)\). When \(\mu = 0\) and \(\sigma^2 = 1\), we call it the standard normal distribution.

For images, noise means adding a random value to each pixel independently. If we have a grayscale pixel with value 0.5 and we add noise sampled from \(\mathcal{N}(0, 0.01)\), the result is a new pixel value centered at 0.5 with a small random perturbation. With variance 0.01, the noise is barely visible. With variance 1.0, the noise overwhelms the signal.

Gaussians have a crucial property that makes the entire diffusion framework work: the sum of two independent Gaussian random variables is itself Gaussian. If \(X \sim \mathcal{N}(\mu_1, \sigma_1^2)\) and \(Y \sim \mathcal{N}(\mu_2, \sigma_2^2)\) are independent, then:

This means that if you add noise to a signal, then add more noise, the result is the same as if you had added a single larger dose of noise. This composability is what enables the closed-form shortcut we will derive in Lesson 3.

For images with multiple pixels, we use a multivariate Gaussian \(\mathcal{N}(\mu, \sigma^2 I)\), where \(I\) is the identity matrix (meaning noise is added independently to each pixel with the same variance). The notation \(\epsilon \sim \mathcal{N}(0, I)\) means we sample a noise vector where every element is drawn independently from a standard normal.

Worked Example

Let a tiny 1x2 “image” have pixel values \(x = [0.8, -0.3]\). We add Gaussian noise with variance \(\sigma^2 = 0.1\):

The noisy image is a sample from \(\mathcal{N}(x, 0.1 \cdot I) = \mathcal{N}([0.8, -0.3], 0.1 \cdot I)\).

Now add a second round of noise with variance \(0.2\). Sample \(\epsilon_2 = [-0.55, 0.38]\), scale by \(\sqrt{0.2} \approx 0.447\): \([-0.246, 0.170]\). Result: \([0.933 - 0.246, -0.524 + 0.170] = [0.687, -0.354]\).

By the Gaussian addition property, this two-step result has the same distribution as adding noise with total variance \(0.1 + 0.2 = 0.3\) in a single step. The intermediate step does not matter – only the total accumulated noise.

Exercises

Recall: What property of Gaussian distributions makes it possible to skip intermediate noise-addition steps and jump directly to any noise level?

Apply: A pixel has value 0.6. You add noise \(\epsilon_1 \sim \mathcal{N}(0, 0.05)\), then add noise \(\epsilon_2 \sim \mathcal{N}(0, 0.15)\). What single Gaussian distribution describes the result? Write it as \(\mathcal{N}(\mu, \sigma^2)\).

Extend: If you keep adding small amounts of noise indefinitely, what distribution does the image converge to? Why does the original image content get destroyed?

Lesson 2: The Forward Process – Destroying Images Step by Step

Imagine you have a photograph and you make 1000 photocopies in sequence, where each copy introduces a tiny amount of static. By copy 1000, the image is pure static. The forward process in DDPM formalizes this as a Markov chain – a sequence where each step depends only on the previous one.

Explanation

The forward process takes a clean image \(x_0\) from the training set and produces a sequence of increasingly noisy versions \(x_1, x_2, \ldots, x_T\) over \(T = 1000\) steps. At each step \(t\), the transition is:

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I)\]

In plain language: take the previous image, shrink it slightly (multiply by \(\sqrt{1-\beta_t}\)), and add Gaussian noise with variance \(\beta_t\). The shrinking is important – without it, the total variance would grow without bound. By shrinking the signal slightly at each step, the process keeps the total variance controlled and converges to a standard normal distribution \(\mathcal{N}(0, I)\).

The variance schedule \(\beta_1, \ldots, \beta_T\) controls how quickly the image is destroyed. The paper uses a linear schedule from \(\beta_1 = 10^{-4}\) to \(\beta_T = 0.02\). Early steps add almost no noise (the image looks nearly unchanged). Later steps add more. By step 1000, the result is indistinguishable from pure random noise.

This is a Markov chain because each step depends only on the immediately previous state, not on any earlier history.

Worked Example

Let a 1-pixel “image” have value \(x_0 = 0.9\). Use \(\beta_1 = 0.0001\) and \(\beta_2 = 0.0002\).

The image barely changed. With \(\beta_1 = 0.0001\), the noise standard deviation is only 0.01.

After two steps, \(x_0 = 0.9\) has become \(x_2 = 0.890\). Still recognizable. After 1000 steps with increasing \(\beta_t\), the signal is completely buried.

Exercises

Recall: Why does the forward process multiply \(x_{t-1}\) by \(\sqrt{1-\beta_t}\) before adding noise? What would happen without this scaling?

Apply: Given \(x_0 = 0.5\), \(\beta_1 = 0.001\), and a noise sample \(\epsilon = -0.8\), compute \(x_1\).

Extend: The paper uses \(T = 1000\) steps with tiny \(\beta_t\) values. Would \(T = 10\) steps with larger \(\beta_t\) achieve the same result? What might go wrong?

Lesson 3: The Closed-Form Shortcut – Jumping to Any Noise Level

Computing \(x_{500}\) by running 500 sequential noise-addition steps would be absurdly wasteful during training. A cornerstone of the DDPM approach is that Gaussians compose: you can jump from \(x_0\) directly to \(x_t\) for any \(t\) in a single computation. This shortcut is what makes training practical.

Explanation

Because each forward step is Gaussian, and Gaussians compose (as we learned in Lesson 1), the distribution of \(x_t\) given \(x_0\) has a closed form:

\[q(x_t | x_0) = \mathcal{N}(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I)\]

\[x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0, I)\]

Think of this as a weighted mix between the clean image (\(x_0\)) and pure noise (\(\epsilon\)). Early in the process (\(t\) small), \(\bar{\alpha}_t \approx 1\), so the clean image dominates. Late in the process (\(t\) large), \(\bar{\alpha}_t \approx 0\), so noise dominates.

Progressive denoising on CIFAR10 (a dataset of 32x32 color images across 10 categories like cars, birds, and horses): the model’s estimated clean image \(\hat{x}_0\) at various points during the 1000-step reverse process (left to right). Large-scale structure like color and shape emerges first; fine details appear last. Each row is a different sample being generated from pure noise.

Worked Example

Let \(x_0 = 0.9\) (a 1-pixel image). Use a 3-step schedule with \(\beta_1 = 0.1\), \(\beta_2 = 0.2\), \(\beta_3 = 0.3\).

\(t\)	\(\beta_t\)	\(\alpha_t = 1 - \beta_t\)	\(\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s\)
1	0.1	0.9	0.9
2	0.2	0.8	\(0.9 \times 0.8 = 0.72\)
3	0.3	0.7	\(0.72 \times 0.7 = 0.504\)

\[x_3 = \sqrt{0.504} \times 0.9 + \sqrt{1 - 0.504} \times 0.65 = 0.710 \times 0.9 + 0.704 \times 0.65 = 0.639 + 0.458 = 1.097\]

At \(t=3\), roughly half the variance comes from the original signal and half from noise. At \(t = 1000\) with the paper’s schedule, \(\bar{\alpha}_{1000} \approx 0.00006\), meaning 99.99% of the variance is noise – the original image is effectively erased.

This shortcut is critical for training: to compute the loss at a random timestep \(t\), we generate \(x_t\) in \(O(1)\) time rather than running \(t\) sequential steps.

Exercises

Recall: What does \(\bar{\alpha}_t\) represent intuitively? What happens to its value as \(t\) increases from 1 to \(T\)?

Apply: Given \(\beta_1 = 0.01\), \(\beta_2 = 0.02\), \(\beta_3 = 0.03\), compute \(\bar{\alpha}_3\). Then compute \(x_3\) for \(x_0 = 0.7\) and \(\epsilon = -0.5\).

Extend: The paper chooses \(\beta_t\) to increase linearly from \(10^{-4}\) to \(0.02\). How would the behavior change if \(\beta_t\) were constant at \(0.01\) for all 1000 steps? Would \(\bar{\alpha}_{1000}\) be larger or smaller than with the linear schedule?

Lesson 4: The Reverse Process – Learning to Denoise

The forward process is easy: just add noise. The hard part is going backward. Given a noisy image \(x_t\), can we recover the slightly less noisy \(x_{t-1}\)? If we can, we chain 1000 such steps starting from pure noise and get a clean image. This is the generative direction – and the reason diffusion models exist.

Explanation

The reverse process learns to undo the forward process one step at a time. Each reverse step is modeled as a Gaussian:

\[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \sigma_t^2 I)\]

The neural network takes two inputs: the current noisy image \(x_t\) and the timestep \(t\) (an integer telling the network how noisy the input is). It outputs a prediction that is used to compute \(\mu_\theta\).

Why is a Gaussian the right choice for the reverse step? When the forward steps are small (small \(\beta_t\)), statistical theory guarantees that the true reverse transition is also approximately Gaussian. Since this paper uses 1000 tiny steps, the Gaussian assumption is well justified.

Now, here is a key insight. During training, we know the original image \(x_0\). If we condition the forward process posterior on \(x_0\), we get a tractable (computationally feasible to calculate in closed form) target distribution:

\[q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1};\, \tilde{\mu}_t(x_t, x_0),\, \tilde{\beta}_t I)\]

\[\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,x_0 + \frac{\sqrt{\alpha_t}\,(1- \bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\, x_t\]

In plain language: if we know both the noisy image \(x_t\) and the original \(x_0\), we can compute exactly where \(x_{t-1}\) should land. The posterior mean is a blend of the original image (pulling toward the truth) and the current noisy image (staying close to where we are). Training the model means making \(\mu_\theta(x_t, t)\) match \(\tilde{\mu}_t(x_t, x_0)\) as closely as possible.

Worked Example

Let \(x_0 = 0.9\), \(t = 2\), and use \(\beta_1 = 0.1\), \(\beta_2 = 0.2\). We computed \(\alpha_1 = 0.9\), \(\alpha_2 = 0.8\), \(\bar{\alpha}_1 = 0.9\), \(\bar{\alpha}_2 = 0.72\).

Suppose the noisy image at step 2 is \(x_2 = 1.1\) (we know this because we ran the forward process).

\[\tilde{\mu}_2 = \frac{\sqrt{0.9} \times 0.2}{1 - 0.72} \times 0.9 + \frac{\sqrt{0.8} \times (1 - 0.9)}{1 - 0.72} \times 1.1\]

\[= \frac{0.949 \times 0.2}{0.28} \times 0.9 + \frac{0.894 \times 0.1}{0.28} \times 1.1\]

\[= \frac{0.1897}{0.28} \times 0.9 + \frac{0.0894}{0.28} \times 1.1 = 0.678 \times 0.9 + 0.319 \times 1.1\]

The posterior variance: \(\tilde{\beta}_2 = \frac{1 - 0.9}{1 - 0.72} \times 0.2 = \frac{0.1}{0.28} \times 0.2 = 0.0714\).

So \(q(x_1 | x_2, x_0) = \mathcal{N}(0.961, 0.0714)\). The reverse step lands between the original value (0.9) and the noisy value (1.1), closer to the original – which is exactly what denoising should do.

Exercises

Recall: Why can we compute the exact posterior \(q(x_{t-1} | x_t, x_0)\) during training but not during generation?

Apply: Given \(x_0 = 0.5\), \(x_2 = 0.7\), \(\bar{\alpha}_1 = 0.95\), \(\bar{\alpha}_2 = 0.85\), \(\alpha_2 = 0.85/0.95 \approx 0.895\), and \(\beta_2 = 1 - 0.895 = 0.105\), compute \(\tilde{\mu}_2\) and \(\tilde{\beta}_2\).

Extend: The posterior mean \(\tilde{\mu}_t\) is a weighted sum of \(x_0\) and \(x_t\). At early timesteps (\(t\) small, little noise), which term dominates? At late timesteps (\(t\) large, lots of noise)? Why does this make intuitive sense?

Lesson 5: The Variational Bound – Training with KL Divergence

We now have a forward process that destroys data and a reverse process that tries to reconstruct it. How do we train the reverse process? The answer comes from variational inference – the same framework used in VAEs (see Auto-Encoding Variational Bayes course). We optimize a bound on the negative log-likelihood that decomposes into manageable per-timestep terms.

Explanation

The goal of a generative model is to maximize the likelihood \(p_\theta(x_0)\) – the probability that the model assigns to real training images. Since computing this exactly is intractable (it requires integrating over all possible noise trajectories), we optimize a variational bound:

\[L = \mathbb{E}_q \bigg[ \underbrace{D_\text{KL}(q(x_T|x_0) \| p(x_T))}_{L_T} + \sum_{t > 1} \underbrace{D_\text{KL}(q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))}_{L_{t-1}} \underbrace{- \log p_\theta(x_0|x_1)}_{L_0} \bigg]\]

The KL divergence between two Gaussians \(\mathcal{N}(\mu_1, \sigma_1^2)\) and \(\mathcal{N}(\mu_2, \sigma_2^2)\) is:

\[D_\text{KL} = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}\]

When \(\sigma_1 = \sigma_2\) (same variance, which holds here because the paper fixes \(\sigma_t\)), this simplifies to:

\[D_\text{KL} = \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2\]

This is just the squared distance between the true posterior mean and the model’s predicted mean, scaled by the variance. Training reduces to making \(\mu_\theta\) match \(\tilde{\mu}_t\) at every timestep.

Worked Example

Consider a single timestep \(t = 5\) for a 1-pixel image. Suppose \(\sigma_5^2 = 0.008\) (the fixed reverse variance), \(\tilde{\mu}_5 = 0.72\) (the true posterior mean), and \(\mu_\theta = 0.68\) (the model’s prediction).

\[L_4 = \frac{1}{2 \times 0.008} |0.72 - 0.68|^2 = \frac{1}{0.016} \times 0.0016 = 0.1\]

The loss dropped from 0.1 to 0.00625 – a 16x improvement from getting 0.03 closer to the true mean. This shows how the quadratic penalty amplifies even small errors.

Exercises

Recall: What are the three types of terms in the variational bound \(L\)? Which one has no learnable parameters?

Apply: Given \(\sigma_t^2 = 0.01\), \(\tilde{\mu}_t = 0.55\), and \(\mu_\theta = 0.60\), compute the KL term \(L_{t-1}\) for this single timestep.

Extend: The variational bound has \(T - 1\) KL terms (one per timestep). If all terms were weighted equally, would the model spend equal effort on easy denoising (small \(t\), barely noisy images) and hard denoising (large \(t\), very noisy images)? What happens to the weighting factor \(\frac{1}{2\sigma_t^2}\) as \(t\) varies?

Lesson 6: The Noise Prediction Parameterization

We know the model needs to predict the posterior mean \(\tilde{\mu}_t\). But what should the neural network actually output? It could predict the clean image \(x_0\), the mean \(\tilde{\mu}_t\) directly, or something else entirely. The paper’s key discovery is that predicting the noise \(\epsilon\) that was added works dramatically better than the alternatives. This is the central technical insight of DDPM.

Explanation

\[x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} (x_t - \sqrt{1 - \bar{\alpha}_t}\, \epsilon)\]

If we substitute this into the posterior mean formula from Lesson 4, after some algebra we get:

\[\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon \right)\]

This is a remarkable simplification. The posterior mean depends on \(x_t\) (which we have) and \(\epsilon\) (the noise that was added, which we do not know at generation time). So the authors propose: let the neural network predict \(\epsilon\). Define \(\epsilon_\theta(x_t, t)\) as the network’s noise prediction, and parameterize:

\[\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right)\]

\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right) + \sigma_t z\]

where \(z \sim \mathcal{N}(0, I)\) if \(t > 1\), and \(z = 0\) if \(t = 1\) (no noise at the final step).

Why does noise prediction work better than directly predicting \(x_0\) or \(\tilde{\mu}_t\)? The paper shows that predicting \(\epsilon\) is mathematically equivalent to estimating the score function \(\nabla_x \log p(x)\) – the gradient of the log probability density of the noisy data. This connects DDPM to a separate line of research called denoising score matching. The score function points “uphill” toward higher probability regions, so a network that estimates it well is learning the structure of the data distribution at every noise level.

Samples generated from the same intermediate noisy image \(x_t\) at different noise levels. At \(t=1000\) (left), the shared latent is nearly pure noise, so samples differ completely. At \(t=250\) (right), most structure is preserved and samples differ only in fine details. This demonstrates the coarse-to-fine generation process.

Worked Example

Let \(x_0 = 0.9\), and suppose at \(t = 2\) with \(\alpha_2 = 0.8\), \(\bar{\alpha}_2 = 0.72\), \(\beta_2 = 0.2\), the noise was \(\epsilon = 0.65\).

The noisy image: \(x_2 = \sqrt{0.72} \times 0.9 + \sqrt{0.28} \times 0.65 = 0.849 \times 0.9 + 0.529 \times 0.65 = 0.764 + 0.344 = 1.108\).

Now the neural network sees \(x_2 = 1.108\) and \(t = 2\). Suppose it predicts \(\epsilon_\theta = 0.60\) (close to the true \(\epsilon = 0.65\), but not perfect).

\[\mu_\theta = \frac{1}{\sqrt{0.8}} \left(1.108 - \frac{0.2}{\sqrt{0.28}} \times 0.60\right) = \frac{1}{0.894} \left(1.108 - \frac{0.2}{0.529} \times 0.60\right)\]

\[= 1.118 \times (1.108 - 0.378 \times 0.60) = 1.118 \times (1.108 - 0.227) = 1.118 \times 0.881 = 0.985\]

\[\mu_\theta = 1.118 \times (1.108 - 0.378 \times 0.65) = 1.118 \times (1.108 - 0.246) = 1.118 \times 0.862 = 0.964\]

The true posterior mean (from Lesson 4) was \(\tilde{\mu}_2 = 0.961\), so perfect noise prediction gives \(\mu_\theta = 0.964\) – very close. The small discrepancy comes from rounding.

Exercises

Recall: The paper tried three parameterizations: predicting \(x_0\) directly, predicting \(\tilde{\mu}_t\), and predicting \(\epsilon\). Which performed best, and what theoretical connection explains its advantage?

Apply: Given \(x_t = 0.5\), \(t = 3\), \(\alpha_3 = 0.7\), \(\bar{\alpha}_3 = 0.504\), \(\beta_3 = 0.3\), and the network predicts \(\epsilon_\theta = 0.3\), compute \(\mu_\theta(x_t, 3)\).

Extend: At very early timesteps (\(t\) small), \(x_t\) is barely noisy. Is predicting \(\epsilon\) easy or hard at these timesteps? What about at very late timesteps (\(t\) close to \(T\)) where \(x_t\) is nearly pure noise?

Lesson 7: The Simplified Training Objective

The full variational bound from Lesson 5 includes a per-timestep weighting factor that upweights easy denoising tasks (small \(t\)) and downweights hard ones (large \(t\)). The paper’s second key contribution is showing that dropping this weighting entirely – treating all timesteps equally – produces dramatically better image quality, even though it slightly hurts the log-likelihood.

Explanation

With the \(\epsilon\)-prediction parameterization, the per-timestep loss term from the variational bound becomes:

\[L_{t-1} = \mathbb{E}_{x_0, \epsilon} \left[ \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t\right) \right\|^2 \right]\]

The weighting factor \(\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)}\) is large for small \(t\) (where \(\beta_t\) is tiny and \(\bar{\alpha}_t\) is close to 1) and smaller for large \(t\). This means the variational bound cares most about the easy denoising steps – removing tiny amounts of noise from nearly clean images.

\[L_\text{simple}(\theta) = \mathbb{E}_{t,\, x_0,\, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t\right) \right\|^2 \right]\]

In plain language: pick a random image, pick a random noise level, add noise, and train the network to predict what noise was added. The loss is the mean squared error between the true noise and the predicted noise, averaged uniformly across all timesteps.

The results were striking. On CIFAR10, \(L_\text{simple}\) achieves an FID (Frechet Inception Distance, a measure of how similar generated images are to real ones – lower is better) of 3.17, while the full variational bound gets FID 13.51. The simplified loss wins by a factor of 4 in sample quality, because it forces the network to focus on hard denoising tasks rather than wasting capacity on easy ones.

Worked Example

Let \(x_0 = [0.8, -0.3]\) (a 2-pixel image), \(T = 3\), and suppose we randomly sample \(t = 2\) with \(\bar{\alpha}_2 = 0.72\). We sample \(\epsilon = [0.42, -0.71]\).

\[x_2 = \sqrt{0.72} \times [0.8, -0.3] + \sqrt{0.28} \times [0.42, -0.71]\] \[= 0.849 \times [0.8, -0.3] + 0.529 \times [0.42, -0.71]\] \[= [0.679, -0.255] + [0.222, -0.376] = [0.901, -0.631]\]

Step 2: Feed \(x_2 = [0.901, -0.631]\) and \(t = 2\) to the network. Suppose it predicts \(\epsilon_\theta = [0.35, -0.60]\).

\[L = \|[0.42, -0.71] - [0.35, -0.60]\|^2 = \|[0.07, -0.11]\|^2 = 0.07^2 + 0.11^2 = 0.0049 + 0.0121 = 0.017\]

This is the MSE for one training step. Over many samples and timesteps, the network learns to predict the noise accurately at every noise level.

Exercises

Recall: What is the key difference between \(L_\text{simple}\) and the full variational bound? Which produces better images, and which produces better log-likelihoods?

Apply: Given \(x_0 = [0.5, 0.5]\), \(t = 1\), \(\bar{\alpha}_1 = 0.9\), and \(\epsilon = [1.0, -1.0]\), compute \(x_1\) and the loss if the network predicts \(\epsilon_\theta = [0.9, -0.8]\).

Extend: The simplified objective treats all timesteps equally. Suppose you found that your model generates images with correct large-scale structure but poor fine details. Which timesteps (\(t\) small or \(t\) large) might you want to upweight? Why?

Lesson 8: Putting It All Together – The Complete DDPM

We now have every piece needed: a forward process that destroys images, a closed-form shortcut for training, a noise-prediction network, and a simplified loss. This lesson assembles them into the complete DDPM – including the U-Net architecture, the sampling algorithm, and the key results.

Explanation

The Architecture. The neural network \(\epsilon_\theta(x_t, t)\) is a U-Net – a convolutional network (a neural network that slides small learned filters across the image to detect local patterns like edges and textures) shaped like the letter U. The image first passes through a series of downsampling blocks (reducing resolution: 32x32 -> 16x16 -> 8x8), then through upsampling blocks (increasing resolution back to 32x32). Skip connections (direct wiring between matching encoder and decoder layers) bridge corresponding resolution levels, allowing fine details to flow from the encoder to the decoder without passing through the bottleneck.

The network needs to know which timestep it is denoising (removing a tiny amount of noise at \(t=1\) is very different from removing massive noise at \(t=900\)). The paper encodes \(t\) using Transformer-style sinusoidal position embeddings (see Attention Is All You Need) – the same technique transformers use to encode token position, repurposed to encode noise level. The network also includes self-attention layers at 16x16 resolution, group normalization (a technique that normalizes activations within groups of channels to stabilize training), and uses the same weights for all 1000 timesteps.

Note that the sampling formula uses \(\frac{1 - \alpha_t}{\sqrt{1-\bar{\alpha}_t}}\) which equals \(\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\) since \(\beta_t = 1 - \alpha_t\).

Four faces generated by DDPM on CelebA-HQ at 256x256 resolution (not cherry-picked). The iterative denoising approach produces photorealistic results competitive with GANs, but with stable training and no mode collapse (a GAN failure where the generator produces only a narrow subset of possible outputs).

Results. On unconditional CIFAR10 (32x32 images, 10 categories), DDPM achieves:

FID (Frechet Inception Distance) measures how similar generated and real images are (lower is better). Inception Score measures quality and diversity (higher is better). DDPM achieves the best FID of any unconditional model, competitive even with models that use class labels (which DDPM does not).

The main limitation is sampling speed: generating one image requires 1000 sequential neural network evaluations. For 256x256 images, sampling a batch of 128 takes about 300 seconds, compared to a single forward pass for GANs.

Worked Example

Model	Inception Score	FID
StyleGAN2 + ADA	9.74	3.26
DDPM (\(L_\text{simple}\))	9.46	3.17
NCSN	8.87	25.32

Let us trace the complete sampling loop for a 1-pixel image with \(T = 3\) steps.

Given: \(\alpha_1 = 0.9\), \(\alpha_2 = 0.8\), \(\alpha_3 = 0.7\), so \(\bar{\alpha}_1 = 0.9\), \(\bar{\alpha}_2 = 0.72\), \(\bar{\alpha}_3 = 0.504\).

Set \(\sigma_t = \sqrt{\beta_t}\), so \(\sigma_3 = \sqrt{0.3} = 0.548\), \(\sigma_2 = \sqrt{0.2} = 0.447\), \(\sigma_1 = \sqrt{0.1} = 0.316\).

Step \(t=3\): Network predicts \(\epsilon_\theta(1.5, 3) = 0.9\). Sample \(z = 0.3\).

\[x_2 = \frac{1}{\sqrt{0.7}} \left(1.5 - \frac{0.3}{\sqrt{0.496}} \times 0.9\right) + 0.548 \times 0.3\]

\[= 1.195 \times (1.5 - 0.426 \times 0.9) + 0.164 = 1.195 \times 1.117 + 0.164 = 1.335 + 0.164 = 1.499\]

Step \(t=2\): Network predicts \(\epsilon_\theta(1.499, 2) = 1.1\). Sample \(z = -0.5\).

\[x_1 = \frac{1}{\sqrt{0.8}} \left(1.499 - \frac{0.2}{\sqrt{0.28}} \times 1.1\right) + 0.447 \times (-0.5)\]

\[= 1.118 \times (1.499 - 0.378 \times 1.1) - 0.224 = 1.118 \times 1.083 - 0.224 = 1.211 - 0.224 = 0.987\]

Step \(t=1\): Network predicts \(\epsilon_\theta(0.987, 1) = 0.2\). Set \(z = 0\) (final step).

\[x_0 = \frac{1}{\sqrt{0.9}} \left(0.987 - \frac{0.1}{\sqrt{0.1}} \times 0.2\right) = 1.054 \times (0.987 - 0.316 \times 0.2) = 1.054 \times 0.924 = 0.974\]

Our generated “image” is \(x_0 = 0.974\). Starting from noise \(x_3 = 1.5\), the network’s denoising predictions guided us to a plausible data point.

Exercises

Recall: Why does the sampling algorithm set \(z = 0\) at the final step (\(t = 1\)) but sample \(z \sim \mathcal{N}(0, I)\) at all other steps?

Apply: Given \(x_3 = -0.8\), \(\alpha_3 = 0.7\), \(\bar{\alpha}_3 = 0.504\), \(\sigma_3 = 0.548\), \(\epsilon_\theta = -0.4\), and \(z = 0.6\), compute \(x_2\).

Extend: DDPM requires 1000 network evaluations to generate one image. DDIM (a later method) showed that deterministic sampling with larger step sizes can reduce this to 50 steps. What trade-off would you expect between the number of sampling steps and image quality?

Comprehension Questions

Hands-On Project

Goal

Build a 1D diffusion model from scratch with numpy that learns to generate samples from a mixture of Gaussians.

Specification

The model should learn to generate samples clustered around \(-2\) and \(+2\), matching the bimodal data distribution.

Starter Code

import numpy as np

# --- Data ---
def sample_data(n):
    """Sample from mixture of two Gaussians: 0.5*N(-2,0.09) + 0.5*N(2,0.09)"""
    components = np.random.choice([-2.0, 2.0], size=n)
    return components + np.random.randn(n) * 0.3

# --- Noise schedule ---
T = 50
betas = np.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bars = np.cumprod(alphas)  # cumulative product

# --- Simple MLP ---
def relu(x):
    return np.maximum(0, x)

def init_mlp(layer_sizes, seed=42):
    """Initialize MLP weights with Xavier initialization."""
    rng = np.random.RandomState(seed)
    params = []
    for i in range(len(layer_sizes) - 1):
        fan_in, fan_out = layer_sizes[i], layer_sizes[i + 1]
        scale = np.sqrt(2.0 / (fan_in + fan_out))
        W = rng.randn(fan_in, fan_out) * scale
        b = np.zeros(fan_out)
        params.append((W, b))
    return params

def forward_mlp(params, x):
    """Forward pass through MLP. ReLU on hidden layers, linear output."""
    for i, (W, b) in enumerate(params):
        x = x @ W + b
        if i < len(params) - 1:
            x = relu(x)
    return x

def backward_mlp(params, x, grad_output):
    """Backward pass. Returns gradients for each (W, b) pair."""
    # Forward pass storing activations
    activations = [x]
    for i, (W, b) in enumerate(params):
        x = x @ W + b
        if i < len(params) - 1:
            x = relu(x)
        activations.append(x)

    grads = []
    g = grad_output
    for i in reversed(range(len(params))):
        W, b = params[i]
        pre_act = activations[i] @ W + b
        if i < len(params) - 1:
            g = g * (pre_act > 0).astype(float)  # ReLU gradient
        grad_W = activations[i].T @ g / len(grad_output)
        grad_b = g.mean(axis=0)
        g = g @ W.T
        grads.append((grad_W, grad_b))

    return list(reversed(grads))

def timestep_embedding(t, dim=32):
    """Sinusoidal timestep embedding (same idea as Transformer position encoding)."""
    half = dim // 2
    freqs = np.exp(-np.log(1000.0) * np.arange(half) / half)
    args = t[:, None] * freqs[None, :]
    return np.concatenate([np.sin(args), np.cos(args)], axis=-1)

# --- TODO: Initialize the network ---
# Input: x_t (1 dim) concatenated with timestep embedding (32 dim) = 33 dims
# Hidden layers: 128, 128
# Output: predicted noise (1 dim)
# params = init_mlp([33, 128, 128, 1])

# --- TODO: Implement the forward diffusion shortcut ---
def q_sample(x_0, t, noise):
    """
    Compute x_t from x_0 using the closed-form formula.
    x_0: clean data, shape (batch,)
    t: timestep indices, shape (batch,), integers in [0, T-1]
    noise: epsilon, shape (batch,), sampled from N(0,1)
    Returns: x_t, shape (batch,)
    """
    pass  # TODO

# --- TODO: Implement noise prediction ---
def predict_noise(params, x_t, t):
    """
    Predict epsilon given noisy data x_t and timestep t.
    Concatenate x_t with timestep embedding and pass through MLP.
    x_t: shape (batch,)
    t: shape (batch,), integer timestep indices
    Returns: predicted noise, shape (batch,)
    """
    pass  # TODO

# --- TODO: Implement one training step ---
def train_step(params, x_0, lr=1e-3):
    """
    One step of L_simple training.
    1. Sample random timesteps
    2. Sample noise
    3. Compute x_t using q_sample
    4. Predict noise
    5. Compute MSE loss
    6. Backpropagate and update params
    Returns: updated params, loss value
    """
    pass  # TODO

# --- TODO: Implement the sampling loop ---
def sample(params, n_samples):
    """
    Generate samples by running the full reverse process.
    1. Start from x_T ~ N(0, 1)
    2. For t = T-1, ..., 0:
       - Predict noise
       - Compute x_{t-1} using the reverse step formula
       - Add noise z if t > 0
    Returns: generated samples, shape (n_samples,)
    """
    pass  # TODO

# --- Training loop ---
# params = init_mlp([33, 128, 128, 1])
# for step in range(5000):
#     x_0 = sample_data(256)
#     params, loss = train_step(params, x_0)
#     if step % 500 == 0:
#         print(f"Step {step}, Loss: {loss:.4f}")
#
# --- Generate and verify ---
# samples = sample(params, 1000)
# print(f"Sample mean: {samples.mean():.2f} (expect ~0.0)")
# print(f"Sample std:  {samples.std():.2f} (expect ~2.1)")
# print(f"Fraction near -2: {((samples > -3) & (samples < -1)).mean():.2f} (expect ~0.50)")
# print(f"Fraction near +2: {((samples > 1) & (samples < 3)).mean():.2f} (expect ~0.50)")

Expected Output

The generated samples should cluster around \(-2\) and \(+2\), matching the bimodal training distribution. The loss values are approximate – exact numbers depend on random seeds, but the loss should decrease steadily and the generated samples should clearly show two modes.

Denoising Diffusion Probabilistic Models: Course

Learning Objectives

Prerequisites

Lesson 1: Gaussian Noise – The Building Block of Diffusion

Explanation

Worked Example

Exercises

Lesson 2: The Forward Process – Destroying Images Step by Step

Explanation

Worked Example

Exercises

Lesson 3: The Closed-Form Shortcut – Jumping to Any Noise Level

Explanation

Worked Example

Exercises

Lesson 4: The Reverse Process – Learning to Denoise

Explanation

Worked Example

Exercises

Lesson 5: The Variational Bound – Training with KL Divergence

Explanation

Worked Example

Exercises

Lesson 6: The Noise Prediction Parameterization

Explanation

Worked Example

Exercises

Lesson 7: The Simplified Training Objective

Explanation

Worked Example

Exercises

Lesson 8: Putting It All Together – The Complete DDPM

Explanation

Worked Example

Exercises

Comprehension Questions

Hands-On Project

Goal

Specification

Starter Code

Expected Output

Further Reading