By the end of this course, you will be able to:
Why do different handwritten 3s look different? Some are round, some have flat tops, some lean left, some lean right. Behind the pixels there are hidden factors – stroke angle, pen pressure, writing speed – that caused each particular image. If we could discover and control these hidden factors, we could both understand existing images and create new ones.
A latent variable is a quantity that isn’t directly observed but influences the data you see. Think of a weather station that only records temperature and humidity. The actual weather pattern (cold front, warm front, stationary) is a latent variable – you never measure it directly, but it determines the temperature and humidity readings.
In generative modeling, we assume data \(x\) (an image, a sentence, a data point) was produced by some hidden code \(z\) through a two-step process:
The prior says “what codes are plausible before seeing any data.” The simplest choice is a standard Gaussian:
\[p(z) = \mathcal{N}(0, I)\]
The likelihood \(p_\theta(x|z)\) says “given a code \(z\), how is data generated?” This is a neural network parameterized by \(\theta\) – the decoder. It takes a code and produces data.
The joint probability of a code-data pair is:
\[p_\theta(x, z) = p(z) \cdot p_\theta(x|z)\]
The probability of the data alone (marginalizing out the hidden code) is:
\[p_\theta(x) = \int p(z) \cdot p_\theta(x|z) \, dz\]
This integral sums up contributions from every possible code \(z\). It asks: “considering all possible hidden causes, how likely is this data?” This quantity is called the marginal likelihood or evidence.
Suppose we have a 1D latent variable \(z\) and 1D data \(x\). The prior is \(p(z) = \mathcal{N}(0, 1)\) and the likelihood is \(p_\theta(x|z) = \mathcal{N}(z + 2, 0.5^2)\) – the data is the code shifted by 2, plus some noise.
Generating a data point:
The joint probability at this specific \((x, z)\) pair:
\[p(z = 0.7) = \frac{1}{\sqrt{2\pi}} e^{-0.7^2 / 2} = \frac{1}{2.507} e^{-0.245} = 0.312\]
\[p_\theta(x = 2.9 \mid z = 0.7) = \frac{1}{\sqrt{2\pi \cdot 0.25}} e^{-(2.9 - 2.7)^2 / (2 \cdot 0.25)} = \frac{1}{1.253} e^{-0.08} = 0.736\]
\[p_\theta(x = 2.9, z = 0.7) = 0.312 \times 0.736 = 0.230\]
The marginal likelihood \(p_\theta(x = 2.9)\) would require integrating this over all possible \(z\) values – easy in this toy case (both are Gaussian, so the integral is a Gaussian convolution), but intractable (impossible to compute exactly in a reasonable amount of time) when the decoder is a neural network.
Recall: What is the difference between a prior \(p(z)\) and a likelihood \(p_\theta(x|z)\)? Which one involves the decoder neural network?
Apply: Given prior \(p(z) = \mathcal{N}(0, 1)\) and likelihood \(p_\theta(x|z) = \mathcal{N}(3z, 1)\), compute the joint probability \(p_\theta(x = 1.5, z = 0.5)\). Show each step.
Extend: Why do we choose \(p(z) = \mathcal{N}(0, I)\) as the prior instead of, say, a uniform distribution over \([-1, 1]^J\)? What property of the Gaussian makes it mathematically convenient for the VAE framework? (Hint: think about what happens when you compute KL divergence between two Gaussians.)
We now have a generative model: sample \(z\), then sample \(x\) given \(z\). But what about the reverse direction? Given a data point \(x\), what code \(z\) produced it? This reverse question turns out to be the central difficulty the VAE paper solves.
Imagine you hear a song on the radio and want to know the original sheet music (the hidden score that generated the sound). Many different scores could produce similar sounds – a chord can be voiced in multiple ways, different instruments can play the same notes. Finding the most likely score given the recording is inference: computing \(p_\theta(z|x)\).
Bayes’ theorem gives the answer:
\[p_\theta(z|x) = \frac{p_\theta(x|z) \cdot p(z)}{p_\theta(x)}\]
The numerator is easy to compute – just multiply the decoder output by the prior. The problem is the denominator:
\[p_\theta(x) = \int p(z) \cdot p_\theta(x|z) \, dz\]
When the decoder is a neural network (a complex nonlinear function), this integral has no closed-form solution. You would need to evaluate the decoder at every possible \(z\) and sum up the results. In a 20-dimensional latent space, “every possible \(z\)” is a 20-dimensional continuous space – the integral is intractable.
Without \(p_\theta(x)\), we cannot compute the posterior \(p_\theta(z|x)\). Without the posterior, we cannot:
The existing solutions each had fatal limitations:
Let’s make the intractability concrete. Suppose we have a 2D latent space (\(J = 2\)) and want to compute \(p_\theta(x)\) for a single image \(x\).
Attempt at numerical integration with a grid: if we discretize each dimension into 100 bins, we need \(100^2 = 10{,}000\) evaluations of \(p_\theta(x|z) \cdot p(z)\). With \(J = 20\) (a typical latent dimension), we’d need \(100^{20} = 10^{40}\) evaluations. At one nanosecond per evaluation, that takes \(10^{40} \times 10^{-9} = 10^{31}\) seconds – about \(3 \times 10^{23}\) years. The universe is \(1.4 \times 10^{10}\) years old.
Even Monte Carlo sampling (random evaluation instead of a grid) struggles. If most \(z\) values give near-zero \(p_\theta(x|z)\) (the decoder produces garbage for most random codes), then random sampling wastes almost every evaluation. You need to sample from regions where \(z\) actually explains \(x\) – but those are exactly the regions defined by the posterior \(p_\theta(z|x)\), which is what you’re trying to compute.
This circular dependency – needing the posterior to efficiently estimate the evidence, but needing the evidence to compute the posterior – is the core problem.
Recall: Why can’t we compute \(p_\theta(z|x)\) directly when the decoder is a neural network? Which specific quantity in Bayes’ theorem is intractable?
Apply: For a 1D latent space, approximate \(p_\theta(x = 2.5)\) by evaluating the integrand at 5 grid points: \(z \in \{-2, -1, 0, 1, 2\}\) with spacing \(\Delta z = 1\). Use \(p(z) = \mathcal{N}(0, 1)\) and \(p_\theta(x|z) = \mathcal{N}(z + 2, 0.5^2)\). Sum up \(p(z_i) \cdot p_\theta(x|z_i) \cdot \Delta z\) for each grid point. How accurate is this crude approximation?
Extend: The EM algorithm iterates between computing the posterior \(p_\theta(z|x)\) (E-step) and updating \(\theta\) to maximize expected log-likelihood (M-step). Explain why this algorithm cannot be used when the posterior is intractable. What would fail in the E-step?
Since we can’t compute the posterior exactly, we’ll approximate it. The idea: learn a second neural network that takes data \(x\) and outputs a distribution over \(z\) that’s close to the true posterior. Then optimize a quantity we can actually compute.
Instead of computing the true posterior \(p_\theta(z|x)\), we train an encoder network \(q_\phi(z|x)\) to approximate it. Think of it as hiring a fast but imperfect assistant: the true posterior is a perfect but impossibly slow oracle, while \(q_\phi\) is a quick neural network that gives a good-enough answer.
The encoder \(q_\phi(z|x)\) is a neural network with parameters \(\phi\) that takes a data point \(x\) and outputs a Gaussian distribution. How do we measure how good this approximation is? Using the KL divergence:
\[D_{KL}(q_\phi(z|x) \| p_\theta(z|x)) = \int q_\phi(z|x) \log \frac{q_\phi(z|x)}{p_\theta(z|x)} \, dz\]
We can’t compute this KL divergence directly (it involves the true posterior, which is intractable). But there’s a clever algebraic trick. Start with the log-evidence:
\[\log p_\theta(x) = D_{KL}(q_\phi(z|x) \| p_\theta(z|x)) + \mathcal{L}(\theta, \phi; x)\]
Since the KL divergence is always non-negative:
\[\log p_\theta(x) \geq \mathcal{L}(\theta, \phi; x)\]
The ELBO is a lower bound on the log-evidence. Maximizing the ELBO simultaneously:
The ELBO decomposes into two interpretable terms:
\[\mathcal{L}(\theta, \phi; x) = \underbrace{-D_{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization}} + \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}}\]
The tension between these two terms shapes the learned representation. The reconstruction term wants maximally informative codes (preserve every detail). The regularization term wants codes close to the prior (a featureless blob). The balance produces codes that capture the important structure in the data while discarding noise.
Let’s trace the ELBO decomposition with concrete numbers.
Suppose for a data point \(x\):
KL divergence (regularization term) between \(\mathcal{N}(\mu, \sigma^2)\) and \(\mathcal{N}(0, 1)\):
\[D_{KL} = -\frac{1}{2}(1 + \log \sigma^2 - \mu^2 - \sigma^2)\]
\[= -\frac{1}{2}(1 + \log(0.36) - 0.64 - 0.36) = -\frac{1}{2}(1 + (-1.022) - 0.64 - 0.36) = -\frac{1}{2}(-1.022) = 0.511\]
So \(-D_{KL} = -0.511\).
Reconstruction term: suppose we sample \(z = 0.8 + 0.6 \times 0.3 = 0.98\) (using noise \(\epsilon = 0.3\)), pass it through the decoder, and get \(\log p_\theta(x|z) = -4.2\).
ELBO: \(\mathcal{L} = -0.511 + (-4.2) = -4.711\)
Verification: The ELBO (\(-4.711\)) is indeed less than \(\log p_\theta(x) = -5.0\)… wait, \(-4.711 > -5.0\). This looks wrong, but it’s actually correct: \(-4.711\) is a larger number than \(-5.0\) on the number line. The bound says \(\log p_\theta(x) \geq \mathcal{L}\), which means \(-5.0 \geq -4.711\) – this is violated!
The issue: my made-up numbers are inconsistent. In reality, the ELBO is always \(\leq\) the true log-evidence. Let me fix this: if the ELBO is \(-4.711\), then the true log-evidence must be at least \(-4.711\). So \(\log p_\theta(x) \geq -4.711\).
This illustrates an important point: you typically don’t know the true log-evidence. The ELBO gives you a guaranteed floor. The tighter the ELBO (closer to the true value), the better your approximate posterior is.
Recall: The ELBO has two terms. Which term pushes codes toward the prior, and which term pushes codes to be informative about the data? What happens if you only optimize one term?
Apply: Given encoder output \(q_\phi(z|x) = \mathcal{N}(1.5, 0.3^2)\) and prior \(p(z) = \mathcal{N}(0, 1)\), compute the KL divergence using the formula \(D_{KL} = -\frac{1}{2}(1 + \log \sigma^2 - \mu^2 - \sigma^2)\). Is this encoder output close to or far from the prior?
Extend: If the regularization term dominates (very strong prior pressure), what happens to the latent codes? What does this imply for the decoder’s ability to reconstruct the input? This failure mode is called posterior collapse – describe what it looks like in practice.
We know we need a neural network \(q_\phi(z|x)\) to approximate the posterior. In this lesson we build it and understand how it turns the intractable inference problem into a single forward pass.
In traditional variational inference, you run an optimization procedure per data point to find the best approximate posterior. This is like hiring a private detective for each case – thorough but expensive. The VAE replaces this with a single neural network that handles all data points at once. This approach is called amortized inference – the cost of inference is amortized across the entire dataset.
The encoder takes a data point \(x\) (a vector of \(D\) numbers, e.g., a flattened 28x28 MNIST image = 784 numbers) and outputs two vectors:
Why \(\log \sigma^2\) instead of \(\sigma\) directly? Because variance must be positive, and \(\log \sigma^2\) can be any real number – no constraint needed on the network output. We recover \(\sigma\) as \(\sigma = \exp(\frac{1}{2} \log \sigma^2)\).
The architecture in the paper is a single-hidden-layer MLP (multilayer perceptron):
\[h = \tanh(W_1 x + b_1)\] \[\mu = W_2 h + b_2\] \[\log \sigma^2 = W_3 h + b_3\]
The approximate posterior for data point \(x\) is then:
\[q_\phi(z|x) = \mathcal{N}(\mu(x), \text{diag}(\sigma^2(x)))\]
Each latent dimension is independent (diagonal covariance). This is a simplification – the true posterior may have correlated dimensions – but it makes computation tractable.
Let’s build a tiny encoder: input dimension \(D = 4\), hidden dimension = 3, latent dimension \(J = 2\).
Weights (small for hand computation):
\[W_1 = \begin{bmatrix} 0.5 & -0.3 & 0.1 & 0.4 \\ -0.2 & 0.6 & -0.4 & 0.1 \\ 0.3 & 0.1 & 0.5 & -0.2 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}\]
\[W_2 = \begin{bmatrix} 0.4 & -0.3 & 0.2 \\ -0.1 & 0.5 & 0.3 \end{bmatrix}, \quad b_2 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]
\[W_3 = \begin{bmatrix} 0.2 & 0.1 & -0.3 \\ 0.1 & -0.2 & 0.4 \end{bmatrix}, \quad b_3 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]
Input: \(x = [1.0, 0.5, -0.5, 0.8]\)
Hidden layer:
\[W_1 x = \begin{bmatrix} 0.5(1.0) + (-0.3)(0.5) + 0.1(-0.5) + 0.4(0.8) \\ -0.2(1.0) + 0.6(0.5) + (-0.4)(-0.5) + 0.1(0.8) \\ 0.3(1.0) + 0.1(0.5) + 0.5(-0.5) + (-0.2)(0.8) \end{bmatrix} = \begin{bmatrix} 0.62 \\ 0.38 \\ 0.04 \end{bmatrix}\]
\[h = \tanh\left(\begin{bmatrix} 0.62 \\ 0.38 \\ 0.04 \end{bmatrix}\right) = \begin{bmatrix} 0.551 \\ 0.363 \\ 0.040 \end{bmatrix}\]
Mean output:
\[\mu = W_2 h = \begin{bmatrix} 0.4(0.551) + (-0.3)(0.363) + 0.2(0.040) \\ -0.1(0.551) + 0.5(0.363) + 0.3(0.040) \end{bmatrix} = \begin{bmatrix} 0.119 \\ 0.138 \end{bmatrix}\]
Log-variance output:
\[\log \sigma^2 = W_3 h = \begin{bmatrix} 0.2(0.551) + 0.1(0.363) + (-0.3)(0.040) \\ 0.1(0.551) + (-0.2)(0.363) + 0.4(0.040) \end{bmatrix} = \begin{bmatrix} 0.134 \\ -0.001 \end{bmatrix}\]
Standard deviation:
\[\sigma = \exp\left(\frac{1}{2} \begin{bmatrix} 0.134 \\ -0.001 \end{bmatrix}\right) = \begin{bmatrix} \exp(0.067) \\ \exp(-0.0005) \end{bmatrix} = \begin{bmatrix} 1.069 \\ 0.9995 \end{bmatrix}\]
So for input \(x = [1.0, 0.5, -0.5, 0.8]\), the encoder says: “the latent code is approximately \(\mathcal{N}([0.119, 0.138], [1.069^2, 0.9995^2])\).”
Both \(\sigma\) values are close to 1.0 and both \(\mu\) values are close to 0 – the encoder hasn’t learned much yet (it’s outputting near-prior distributions). After training, we’d expect \(\sigma\) values to shrink (more confident encoding) and \(\mu\) values to separate for different data points.
Recall: Why does the encoder output \(\log \sigma^2\) instead of \(\sigma\) directly? What constraint does this avoid?
Apply: Given \(\log \sigma^2 = [-0.5, 1.2]\), compute \(\sigma\) for each dimension. Which latent dimension has more uncertainty?
Extend: The encoder uses a shared hidden layer \(h\) for both \(\mu\) and \(\log \sigma^2\). What would change if they had completely separate networks (no shared computation)? What advantage does sharing provide?
The decoder is the generative half of the VAE: given a latent code \(z\), it produces (a distribution over) data \(x\). The choice of output distribution determines the reconstruction loss.
Think of the decoder as an artist who receives a brief description (the code \(z\)) and must recreate a painting. The description might say “face, smiling, left-facing” – the artist fills in all the details.
The decoder is a neural network \(p_\theta(x|z)\) that takes a latent code \(z\) (a vector of \(J\) numbers) and produces parameters for a distribution over \(x\). The architecture mirrors the encoder:
\[h = \tanh(W_4 z + b_4)\] \[\text{output} = f(W_5 h + b_5)\]
The activation function \(f\) and the output distribution depend on the data type:
Binary data (e.g., binarized MNIST where each pixel is 0 or 1): the decoder outputs probabilities via sigmoid, and each pixel is modeled as an independent Bernoulli (a distribution over two outcomes, 0 or 1, parameterized by a single probability):
\[p_\theta(x|z) = \prod_{d=1}^{D} y_d^{x_d} (1 - y_d)^{1 - x_d}\]
where \(y = \sigma(W_5 h + b_5)\) is the vector of pixel probabilities. The reconstruction loss is binary cross-entropy:
\[\log p_\theta(x|z) = \sum_{d=1}^{D} \left[ x_d \log y_d + (1 - x_d) \log(1 - y_d) \right]\]
Continuous data (e.g., Frey Face dataset): the decoder outputs a mean (and optionally variance), and each dimension is modeled as an independent Gaussian:
\[\log p_\theta(x|z) = -\frac{D}{2}\log(2\pi) - \frac{1}{2}\sum_{d=1}^{D} \left[\log \sigma_d^2 + \frac{(x_d - \mu_d)^2}{\sigma_d^2}\right]\]
When the variance is fixed at 1, this simplifies to the mean squared error (up to a constant):
\[\log p_\theta(x|z) \propto -\frac{1}{2} \|x - \mu\|^2\]
This is why VAEs produce blurry images: minimizing squared error is equivalent to predicting the average of all plausible reconstructions, not a single sharp one.
Let’s trace the decoder with a Bernoulli output for binary data.
Setup: latent dimension \(J = 2\), hidden dimension = 3, output dimension \(D = 4\).
\[W_4 = \begin{bmatrix} 0.3 & -0.5 \\ 0.7 & 0.2 \\ -0.1 & 0.6 \end{bmatrix}, \quad b_4 = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}\]
\[W_5 = \begin{bmatrix} 0.4 & 0.2 & -0.3 \\ -0.1 & 0.5 & 0.3 \\ 0.3 & -0.2 & 0.6 \\ 0.1 & 0.4 & -0.1 \end{bmatrix}, \quad b_5 = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}\]
Latent code: \(z = [0.5, -0.3]\)
Hidden layer:
\[W_4 z = \begin{bmatrix} 0.3(0.5) + (-0.5)(-0.3) \\ 0.7(0.5) + 0.2(-0.3) \\ -0.1(0.5) + 0.6(-0.3) \end{bmatrix} = \begin{bmatrix} 0.30 \\ 0.29 \\ -0.23 \end{bmatrix}\]
\[h = \tanh\left(\begin{bmatrix} 0.30 \\ 0.29 \\ -0.23 \end{bmatrix}\right) = \begin{bmatrix} 0.291 \\ 0.282 \\ -0.225 \end{bmatrix}\]
Output probabilities:
\[W_5 h = \begin{bmatrix} 0.4(0.291) + 0.2(0.282) + (-0.3)(-0.225) \\ -0.1(0.291) + 0.5(0.282) + 0.3(-0.225) \\ 0.3(0.291) + (-0.2)(0.282) + 0.6(-0.225) \\ 0.1(0.291) + 0.4(0.282) + (-0.1)(-0.225) \end{bmatrix} = \begin{bmatrix} 0.240 \\ 0.044 \\ -0.104 \\ 0.164 \end{bmatrix}\]
\[y = \sigma\left(\begin{bmatrix} 0.240 \\ 0.044 \\ -0.104 \\ 0.164 \end{bmatrix}\right) = \begin{bmatrix} 0.560 \\ 0.511 \\ 0.474 \\ 0.541 \end{bmatrix}\]
Reconstruction loss: suppose the true data is \(x = [1, 0, 1, 1]\).
\[\log p_\theta(x|z) = 1 \cdot \log(0.560) + 0 \cdot \log(0.511) + 1 \cdot \log(1 - 0.511) + 0 \cdot \log(0.474) + 1 \cdot \log(0.474) + 0 \cdot \log(0.541) + 1 \cdot \log(0.541)\]
Wait – let me be careful with the formula:
\[= \log(0.560) + \log(1 - 0.511) + \log(0.474) + \log(0.541)\]
\[= -0.580 + (-0.715) + (-0.747) + (-0.614) = -2.656\]
The reconstruction log-likelihood is \(-2.656\). A perfect decoder would give \(\log p = 0\) (probabilities of 1.0 for each correct pixel). This decoder is far from perfect – it’s essentially guessing near 0.5 for everything.
Recall: What is the difference between a Bernoulli decoder and a Gaussian decoder? When would you use each?
Apply: A Bernoulli decoder outputs probabilities \(y = [0.9, 0.1, 0.8, 0.3]\) for a binary target \(x = [1, 0, 1, 0]\). Compute \(\log p_\theta(x|z)\). Is this a good reconstruction?
Extend: Why do Gaussian decoders with fixed variance produce blurry images? Think about what the decoder learns to output when the same code \(z\) maps to slightly different images \(x_1\) and \(x_2\) – what minimizes \(\|x - \mu\|^2\) averaged over both?
This is the paper’s central technical contribution. We have an encoder that outputs \(\mu\) and \(\sigma\), and we need to sample \(z\) from \(\mathcal{N}(\mu, \sigma^2)\) to pass to the decoder. But sampling is a random operation – gradients can’t flow backward through a dice roll. The reparameterization trick solves this.
Imagine a factory assembly line. Parts move from Station A (encoder) to Station B (decoder). Between them, a randomizer shuffles parts – sometimes a part goes left, sometimes right, unpredictably. You want to improve Station A based on feedback from Station B, but the randomizer breaks the chain of cause and effect. You can’t trace “this defective output came from that specific input” because the randomizer scrambled things.
The trick: move the randomizer before Station A. Now the random input enters the assembly line at the very beginning, and everything after that is a deterministic function. You can trace any output back to its input and compute exactly how changing Station A’s settings would change the output.
Mathematically, instead of sampling:
\[z \sim \mathcal{N}(\mu, \sigma^2)\]
we sample noise from a fixed distribution and compute \(z\) deterministically:
\[\epsilon \sim \mathcal{N}(0, I)\] \[z = \mu + \sigma \odot \epsilon\]
The distribution of \(z\) is unchanged – it’s still \(\mathcal{N}(\mu, \sigma^2)\). But now the gradient \(\frac{\partial z}{\partial \mu} = 1\) and \(\frac{\partial z}{\partial \sigma} = \epsilon\) are well-defined. Backpropagation works through the entire encoder-decoder pipeline.
This is the insight that made the entire VAE trainable. Before this trick, variational inference with neural networks required high-variance gradient estimators like REINFORCE (a method that estimates gradients by weighting sampled function values by their log-probability scores) that were too noisy for practical use. The reparameterization trick produces low-variance gradients because the randomness is moved outside the computational graph.
Let’s trace gradients with and without the trick for a 2D latent space.
Encoder output for some input \(x\): \(\mu = [1.2, -0.5]\), \(\sigma = [0.8, 0.3]\)
Sample noise: \(\epsilon = [0.7, -1.1]\) from \(\mathcal{N}(0, I)\)
Compute \(z\):
\[z = \mu + \sigma \odot \epsilon = [1.2, -0.5] + [0.8, 0.3] \odot [0.7, -1.1]\]
\[= [1.2, -0.5] + [0.56, -0.33] = [1.76, -0.83]\]
Gradients of \(z\) with respect to encoder parameters:
\[\frac{\partial z_1}{\partial \mu_1} = 1, \quad \frac{\partial z_1}{\partial \sigma_1} = \epsilon_1 = 0.7\]
\[\frac{\partial z_2}{\partial \mu_2} = 1, \quad \frac{\partial z_2}{\partial \sigma_2} = \epsilon_2 = -1.1\]
\[\frac{\partial z_1}{\partial \mu_2} = 0, \quad \frac{\partial z_2}{\partial \mu_1} = 0 \quad \text{(dimensions are independent)}\]
These gradients are clean and well-defined. Suppose the loss (negative ELBO) has gradient \(\frac{\partial \mathcal{L}}{\partial z} = [0.3, -0.2]\) coming back from the decoder. By the chain rule:
\[\frac{\partial \mathcal{L}}{\partial \mu_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial \mu_1} = 0.3 \times 1 = 0.3\]
\[\frac{\partial \mathcal{L}}{\partial \sigma_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial \sigma_1} = 0.3 \times 0.7 = 0.21\]
\[\frac{\partial \mathcal{L}}{\partial \mu_2} = -0.2 \times 1 = -0.2\]
\[\frac{\partial \mathcal{L}}{\partial \sigma_2} = -0.2 \times (-1.1) = 0.22\]
The gradients flow cleanly from the loss, through \(z\), back to \(\mu\) and \(\sigma\), and from there back through the encoder network to update \(\phi\). The entire encoder-decoder system is trained end-to-end.
Without the trick: \(z\) is sampled directly from \(\mathcal{N}(\mu, \sigma^2)\). The sampling operation has no well-defined gradient with respect to \(\mu\) and \(\sigma\). The alternative (REINFORCE estimator) would give \(\frac{\partial}{\partial \phi} \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{q_\phi}[f(z) \nabla_\phi \log q_\phi(z|x)]\), which has much higher variance and requires many samples to get a usable gradient signal.
Recall: In the reparameterization trick, what is held fixed during backpropagation: \(\mu\), \(\sigma\), or \(\epsilon\)? Which quantities receive gradients?
Apply: Given \(\mu = [0.5]\), \(\sigma = [1.2]\), and two noise samples \(\epsilon_1 = 0.3\) and \(\epsilon_2 = -0.8\), compute \(z_1\) and \(z_2\). Then compute \(\frac{\partial z}{\partial \mu}\) and \(\frac{\partial z}{\partial \sigma}\) for each. Which sample gives a larger gradient for \(\sigma\)?
Extend: The reparameterization trick only works for distributions where sampling can be expressed as a deterministic transformation of fixed noise. It works for Gaussians (\(z = \mu + \sigma \epsilon\)) and some other distributions. Can you think of a distribution where this trick would not work? (Hint: consider discrete distributions, like a Bernoulli – can you express “sample 0 or 1 with probability \(p\)” as a smooth function of \(p\) and some noise?)
We now have all the pieces. This lesson assembles them into the complete training algorithm and shows how to generate new data.
The VAE training algorithm (Algorithm 1 in the paper) processes data in minibatches (small random subsets of the training data, processed together in one update step). For each minibatch, it computes the ELBO and updates all parameters by gradient ascent.
The complete loss for a single data point, when both the prior and approximate posterior are Gaussian, has a convenient closed form:
\[\mathcal{L}(\theta, \phi; x^{(i)}) \simeq \frac{1}{2} \sum_{j=1}^{J} \left(1 + \log((\sigma_j^{(i)})^2) - (\mu_j^{(i)})^2 - (\sigma_j^{(i)})^2\right) + \frac{1}{L} \sum_{l=1}^{L} \log p_\theta(x^{(i)} | z^{(i,l)})\]
where \(z^{(i,l)} = \mu^{(i)} + \sigma^{(i)} \odot \epsilon^{(l)}\) and \(\epsilon^{(l)} \sim \mathcal{N}(0, I)\).
Training algorithm:
Generating new data (after training):
That’s it. No adversarial training, no MCMC chains, no iterative optimization. A single forward pass through the decoder turns noise into data.
The learned MNIST manifold with a 2D latent space. Each position in the grid is decoded into a digit image. Smooth transitions between classes (6 blending into 0, 9 into 7) show the VAE learned a continuous, structured representation where nearby codes produce similar images.
The Frey Face manifold in 2D latent space. Moving through the latent space smoothly varies facial expression and pose. The VAE discovered these meaningful factors without any labels.
Random samples generated by sampling \(z \sim \mathcal{N}(0, I)\) and decoding. Digits are recognizable but blurry – the characteristic VAE limitation caused by the Gaussian reconstruction loss, which averages over possible outputs rather than committing to sharp details.
Let’s trace one complete training step for a single data point.
Setup: latent dimension \(J = 2\), data dimension \(D = 3\) (binary), \(L = 1\) sample.
Data point: \(x = [1, 0, 1]\)
Step 1: Encode
Encoder outputs: \(\mu = [0.8, -0.3]\), \(\log \sigma^2 = [-0.5, 0.2]\)
\[\sigma = \exp([-0.25, 0.1]) = [0.779, 1.105]\]
Step 2: Sample (reparameterization trick)
Noise: \(\epsilon = [0.4, -0.7]\)
\[z = [0.8, -0.3] + [0.779, 1.105] \odot [0.4, -0.7] = [0.8, -0.3] + [0.312, -0.774] = [1.112, -1.074]\]
Step 3: Decode
Decoder outputs probabilities: \(y = [0.82, 0.15, 0.71]\) (after sigmoid)
Step 4: Compute ELBO
KL term (closed form, per dimension):
\[\text{KL}_1 = -\frac{1}{2}(1 + (-0.5) - 0.64 - 0.607) = -\frac{1}{2}(-0.747) = 0.374\]
\[\text{KL}_2 = -\frac{1}{2}(1 + 0.2 - 0.09 - 1.221) = -\frac{1}{2}(-0.111) = 0.056\]
\[-D_{KL} = -(0.374 + 0.056) = -0.430\]
Reconstruction term:
\[\log p_\theta(x|z) = \log(0.82) + \log(1 - 0.15) + \log(0.71)\]
\[= -0.198 + (-0.163) + (-0.342) = -0.703\]
ELBO:
\[\mathcal{L} = -0.430 + (-0.703) = -1.133\]
We maximize this, which means minimizing \(-\mathcal{L} = 1.133\). The KL penalty (\(0.430\)) is the cost of the encoder deviating from the prior. The reconstruction cost (\(0.703\)) is the price of imperfect decoding.
Step 5: Gradient update
Compute \(\nabla_\theta \mathcal{L}\) and \(\nabla_\phi \mathcal{L}\) via backpropagation through the entire encoder → sample → decoder pipeline (the reparameterization trick makes this possible). Update all weights.
Recall: In the VAE loss function, which term can be computed in closed form (no sampling) and which requires sampling through the reparameterization trick?
Apply: For a 3D latent space with encoder outputs \(\mu = [0.5, -1.0, 0.2]\) and \(\sigma = [0.8, 0.5, 1.1]\), compute the KL divergence \(D_{KL}(q_\phi \| p)\) using the closed-form expression. Which latent dimension contributes the most to the KL divergence?
Extend: The paper found that \(L = 1\) (a single noise sample per data point) works well with minibatch size \(M = 100\). Why does a larger minibatch compensate for fewer samples? Think about what happens to the gradient variance as \(M\) increases.
Explain the encoder-decoder structure of the VAE in your own words. What does each network take as input and produce as output? Why do we need both?
Why does the ELBO have two terms pulling in opposite directions? What happens if you remove the KL term entirely? What happens if you set its weight very high?
Compare the VAE to the GAN (see Generative Adversarial Nets). Both generate new data from noise. What does the VAE provide that the GAN doesn’t? What does the GAN do better? Why?
The VAE produces blurry images while GANs produce sharp ones. Trace this back to a specific design choice in the VAE. What would you need to change to get sharper outputs?
The reparameterization trick is described as the paper’s key technical contribution. What specific problem does it solve? What was the alternative before this trick, and why was it impractical?
Implement a VAE from scratch using numpy to learn a 2D latent representation of toy data sampled from a mixture of three Gaussians.
import numpy as np
np.random.seed(42)
# ── Hyperparameters ──────────────────────────────────────────────
INPUT_DIM = 2
LATENT_DIM = 2
HIDDEN_DIM = 16
BATCH_SIZE = 64
LR = 0.001
TRAIN_STEPS = 5000
DATA_VARIANCE = 0.1 # fixed decoder variance
# ── Generate toy data ────────────────────────────────────────────
# Three Gaussian clusters forming a triangle
CENTERS = np.array([[3.0, 0.0], [-1.5, 2.6], [-1.5, -2.6]])
CLUSTER_STD = 0.4
def sample_data(n):
"""Sample n points from the mixture of 3 Gaussians."""
labels = np.random.randint(0, 3, n)
points = CENTERS[labels] + np.random.randn(n, 2) * CLUSTER_STD
return points
# ── Activation functions ─────────────────────────────────────────
def sigmoid(x):
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
def tanh(x):
return np.tanh(x)
def dtanh(t):
"""Derivative of tanh, given tanh output t."""
return 1 - t ** 2
# ── Weight initialization ───────────────────────────────────────
def init_weights(fan_in, fan_out):
scale = np.sqrt(2.0 / (fan_in + fan_out))
return np.random.randn(fan_in, fan_out) * scale
# ── Encoder ──────────────────────────────────────────────────────
# Architecture: input(2) -> hidden(16) -> mu(2), logvar(2)
enc_W1 = init_weights(INPUT_DIM, HIDDEN_DIM)
enc_b1 = np.zeros((1, HIDDEN_DIM))
enc_W_mu = init_weights(HIDDEN_DIM, LATENT_DIM)
enc_b_mu = np.zeros((1, LATENT_DIM))
enc_W_logvar = init_weights(HIDDEN_DIM, LATENT_DIM)
enc_b_logvar = np.zeros((1, LATENT_DIM))
def encoder_forward(x):
"""
Forward pass through encoder.
x: shape (batch, 2)
Returns: (mu, logvar, cache)
mu: shape (batch, 2) -- mean of q(z|x)
logvar: shape (batch, 2) -- log variance of q(z|x)
cache: values needed for backward pass
"""
# TODO: Implement forward pass
# Layer 1: linear + tanh
# Output heads: mu (linear) and logvar (linear)
pass
def encoder_backward(d_mu, d_logvar, cache):
"""
Backward pass through encoder.
d_mu: gradient of loss w.r.t. mu, shape (batch, 2)
d_logvar: gradient of loss w.r.t. logvar, shape (batch, 2)
cache: saved values from forward pass
Returns: dict of gradients for each weight matrix and bias
"""
# TODO: Implement backward pass
# Combine gradients from both output heads through the shared hidden layer
pass
# ── Decoder ──────────────────────────────────────────────────────
# Architecture: latent(2) -> hidden(16) -> output(2)
dec_W1 = init_weights(LATENT_DIM, HIDDEN_DIM)
dec_b1 = np.zeros((1, HIDDEN_DIM))
dec_W2 = init_weights(HIDDEN_DIM, INPUT_DIM)
dec_b2 = np.zeros((1, INPUT_DIM))
def decoder_forward(z):
"""
Forward pass through decoder (Gaussian with fixed variance).
z: shape (batch, 2)
Returns: (x_recon, cache)
x_recon: shape (batch, 2) -- reconstructed mean
cache: values needed for backward pass
"""
# TODO: Implement forward pass
# Layer 1: linear + tanh
# Layer 2: linear (no activation -- output is the Gaussian mean)
pass
def decoder_backward(d_output, cache):
"""
Backward pass through decoder.
d_output: gradient of loss w.r.t. decoder output, shape (batch, 2)
cache: saved values from forward pass
Returns: (dict of gradients, d_z) where d_z is gradient w.r.t. input z
"""
# TODO: Implement backward pass
pass
# ── Loss computation ─────────────────────────────────────────────
def compute_loss(x, x_recon, mu, logvar):
"""
Compute the negative ELBO (the loss to minimize).
Returns: (loss, d_x_recon, d_mu, d_logvar)
loss: scalar, the negative ELBO averaged over the batch
d_x_recon: gradient of loss w.r.t. x_recon
d_mu: gradient of loss w.r.t. mu
d_logvar: gradient of loss w.r.t. logvar
"""
batch_size = x.shape[0]
# TODO: Compute reconstruction loss (Gaussian log-likelihood with fixed variance)
# recon_loss = 0.5 * sum((x - x_recon)^2) / DATA_VARIANCE
# This is -log p(x|z) up to a constant
# TODO: Compute KL divergence (closed form for diagonal Gaussian vs N(0,I))
# kl_loss = -0.5 * sum(1 + logvar - mu^2 - exp(logvar))
# TODO: Total loss = (recon_loss + kl_loss) / batch_size
# TODO: Compute gradients of loss w.r.t. x_recon, mu, logvar
# d_x_recon = (x_recon - x) / (DATA_VARIANCE * batch_size)
# d_mu = mu / batch_size
# d_logvar = 0.5 * (exp(logvar) - 1) / batch_size
pass
# ── Reparameterization trick ────────────────────────────────────
def reparameterize(mu, logvar):
"""
Sample z = mu + sigma * epsilon, where epsilon ~ N(0, I).
Returns: (z, epsilon)
"""
# TODO: Implement the reparameterization trick
pass
# ── Training loop ────────────────────────────────────────────────
for step in range(TRAIN_STEPS):
# Sample minibatch
x = sample_data(BATCH_SIZE)
# ── Forward pass ──────────────────────────────────────────
# 1. Encode: x -> mu, logvar
# 2. Reparameterize: sample z
# 3. Decode: z -> x_recon
# 4. Compute loss and gradients
# TODO: Implement forward pass and loss computation
# ── Backward pass ─────────────────────────────────────────
# 1. Backprop through decoder to get gradients for decoder weights and d_z
# 2. Backprop through reparameterization: d_mu += d_z, d_logvar += d_z * 0.5 * epsilon * exp(0.5 * logvar)
# 3. Backprop through encoder to get gradients for encoder weights
# TODO: Implement backward pass
# ── Parameter update (SGD) ────────────────────────────────
# TODO: Update all weights and biases with learning rate LR
# ── Logging ───────────────────────────────────────────────
if step % 500 == 0:
# Compute loss on a larger batch for stable reporting
x_eval = sample_data(1000)
mu_e, logvar_e, _ = encoder_forward(x_eval)
z_e, _ = reparameterize(mu_e, logvar_e)
x_recon_e, _ = decoder_forward(z_e)
recon = 0.5 * np.sum((x_eval - x_recon_e) ** 2) / (DATA_VARIANCE * 1000)
kl = -0.5 * np.sum(1 + logvar_e - mu_e ** 2 - np.exp(logvar_e)) / 1000
print(f"Step {step:5d} | Loss: {recon + kl:.3f} | "
f"Recon: {recon:.3f} | KL: {kl:.3f}")
# ── Generation ───────────────────────────────────────────────────
print("\nGenerating new samples from the prior...")
z_samples = np.random.randn(10, LATENT_DIM)
generated, _ = decoder_forward(z_samples)
print("Generated points:")
for i in range(10):
print(f" z=[{z_samples[i,0]:+.2f}, {z_samples[i,1]:+.2f}] -> "
f"x=[{generated[i,0]:+.2f}, {generated[i,1]:+.2f}]")
# ── Verify cluster structure ─────────────────────────────────────
print("\nCluster centers in data space:", CENTERS.tolist())
print("Do generated points cluster near the three centers?")After 5000 training steps, the VAE should learn the three-cluster structure:
Step 0 | Loss: 23.417 | Recon: 23.105 | KL: 0.312
Step 500 | Loss: 5.832 | Recon: 4.291 | KL: 1.541
Step 1000 | Loss: 3.215 | Recon: 1.824 | KL: 1.391
Step 1500 | Loss: 2.647 | Recon: 1.318 | KL: 1.329
Step 2000 | Loss: 2.401 | Recon: 1.095 | KL: 1.306
Step 2500 | Loss: 2.289 | Recon: 0.988 | KL: 1.301
Step 3000 | Loss: 2.221 | Recon: 0.923 | KL: 1.298
Step 3500 | Loss: 2.184 | Recon: 0.887 | KL: 1.297
Step 4000 | Loss: 2.159 | Recon: 0.862 | KL: 1.296
Step 4500 | Loss: 2.142 | Recon: 0.846 | KL: 1.296
Generating new samples from the prior...
Generated points:
z=[+0.50, -0.14] -> x=[+2.87, -0.12]
z=[-0.57, +1.02] -> x=[-1.32, +2.41]
z=[+0.33, +0.75] -> x=[+1.15, +1.83]
z=[-1.10, -0.21] -> x=[-1.62, -0.45]
z=[+1.24, -0.38] -> x=[+3.21, -0.53]
...
Key things to verify:
Note: exact numbers will vary due to random initialization.