course

Why do different handwritten 3s look different? Some are round, some have flat tops, some lean left, some lean right. Behind the pixels there are hidden factors – stroke angle, pen pressure, writing speed – that caused each particular image. If we could discover and control these hidden factors, we could both understand existing images and create new ones.

Explanation

A latent variable is a quantity that isn’t directly observed but influences the data you see. Think of a weather station that only records temperature and humidity. The actual weather pattern (cold front, warm front, stationary) is a latent variable – you never measure it directly, but it determines the temperature and humidity readings.

In generative modeling, we assume data \(x\) (an image, a sentence, a data point) was produced by some hidden code \(z\) through a two-step process:

The prior says “what codes are plausible before seeing any data.” The simplest choice is a standard Gaussian:

The likelihood \(p_\theta(x|z)\) says “given a code \(z\), how is data generated?” This is a neural network parameterized by \(\theta\) – the decoder. It takes a code and produces data.

This integral sums up contributions from every possible code \(z\). It asks: “considering all possible hidden causes, how likely is this data?” This quantity is called the marginal likelihood or evidence.

Worked Example

Suppose we have a 1D latent variable \(z\) and 1D data \(x\). The prior is \(p(z) = \mathcal{N}(0, 1)\) and the likelihood is \(p_\theta(x|z) = \mathcal{N}(z + 2, 0.5^2)\) – the data is the code shifted by 2, plus some noise.

\[p(z = 0.7) = \frac{1}{\sqrt{2\pi}} e^{-0.7^2 / 2} = \frac{1}{2.507} e^{-0.245} = 0.312\]

\[p_\theta(x = 2.9 \mid z = 0.7) = \frac{1}{\sqrt{2\pi \cdot 0.25}} e^{-(2.9 - 2.7)^2 / (2 \cdot 0.25)} = \frac{1}{1.253} e^{-0.08} = 0.736\]

The marginal likelihood \(p_\theta(x = 2.9)\) would require integrating this over all possible \(z\) values – easy in this toy case (both are Gaussian, so the integral is a Gaussian convolution), but intractable (impossible to compute exactly in a reasonable amount of time) when the decoder is a neural network.

Exercises

Recall: What is the difference between a prior \(p(z)\) and a likelihood \(p_\theta(x|z)\)? Which one involves the decoder neural network?

Apply: Given prior \(p(z) = \mathcal{N}(0, 1)\) and likelihood \(p_\theta(x|z) = \mathcal{N}(3z, 1)\), compute the joint probability \(p_\theta(x = 1.5, z = 0.5)\). Show each step.

Extend: Why do we choose \(p(z) = \mathcal{N}(0, I)\) as the prior instead of, say, a uniform distribution over \([-1, 1]^J\)? What property of the Gaussian makes it mathematically convenient for the VAE framework? (Hint: think about what happens when you compute KL divergence between two Gaussians.)

Lesson 2: The Inference Problem

We now have a generative model: sample \(z\), then sample \(x\) given \(z\). But what about the reverse direction? Given a data point \(x\), what code \(z\) produced it? This reverse question turns out to be the central difficulty the VAE paper solves.

Explanation

Imagine you hear a song on the radio and want to know the original sheet music (the hidden score that generated the sound). Many different scores could produce similar sounds – a chord can be voiced in multiple ways, different instruments can play the same notes. Finding the most likely score given the recording is inference: computing \(p_\theta(z|x)\).

The numerator is easy to compute – just multiply the decoder output by the prior. The problem is the denominator:

When the decoder is a neural network (a complex nonlinear function), this integral has no closed-form solution. You would need to evaluate the decoder at every possible \(z\) and sum up the results. In a 20-dimensional latent space, “every possible \(z\)” is a 20-dimensional continuous space – the integral is intractable.

Without \(p_\theta(x)\), we cannot compute the posterior \(p_\theta(z|x)\). Without the posterior, we cannot:

Worked Example

Let’s make the intractability concrete. Suppose we have a 2D latent space (\(J = 2\)) and want to compute \(p_\theta(x)\) for a single image \(x\).

Attempt at numerical integration with a grid: if we discretize each dimension into 100 bins, we need \(100^2 = 10{,}000\) evaluations of \(p_\theta(x|z) \cdot p(z)\). With \(J = 20\) (a typical latent dimension), we’d need \(100^{20} = 10^{40}\) evaluations. At one nanosecond per evaluation, that takes \(10^{40} \times 10^{-9} = 10^{31}\) seconds – about \(3 \times 10^{23}\) years. The universe is \(1.4 \times 10^{10}\) years old.

Even Monte Carlo sampling (random evaluation instead of a grid) struggles. If most \(z\) values give near-zero \(p_\theta(x|z)\) (the decoder produces garbage for most random codes), then random sampling wastes almost every evaluation. You need to sample from regions where \(z\) actually explains \(x\) – but those are exactly the regions defined by the posterior \(p_\theta(z|x)\), which is what you’re trying to compute.

This circular dependency – needing the posterior to efficiently estimate the evidence, but needing the evidence to compute the posterior – is the core problem.

Exercises

Recall: Why can’t we compute \(p_\theta(z|x)\) directly when the decoder is a neural network? Which specific quantity in Bayes’ theorem is intractable?

Apply: For a 1D latent space, approximate \(p_\theta(x = 2.5)\) by evaluating the integrand at 5 grid points: \(z \in \{-2, -1, 0, 1, 2\}\) with spacing \(\Delta z = 1\). Use \(p(z) = \mathcal{N}(0, 1)\) and \(p_\theta(x|z) = \mathcal{N}(z + 2, 0.5^2)\). Sum up \(p(z_i) \cdot p_\theta(x|z_i) \cdot \Delta z\) for each grid point. How accurate is this crude approximation?

Extend: The EM algorithm iterates between computing the posterior \(p_\theta(z|x)\) (E-step) and updating \(\theta\) to maximize expected log-likelihood (M-step). Explain why this algorithm cannot be used when the posterior is intractable. What would fail in the E-step?

Lesson 3: Variational Inference and the ELBO

Since we can’t compute the posterior exactly, we’ll approximate it. The idea: learn a second neural network that takes data \(x\) and outputs a distribution over \(z\) that’s close to the true posterior. Then optimize a quantity we can actually compute.

Explanation

Instead of computing the true posterior \(p_\theta(z|x)\), we train an encoder network \(q_\phi(z|x)\) to approximate it. Think of it as hiring a fast but imperfect assistant: the true posterior is a perfect but impossibly slow oracle, while \(q_\phi\) is a quick neural network that gives a good-enough answer.

The encoder \(q_\phi(z|x)\) is a neural network with parameters \(\phi\) that takes a data point \(x\) and outputs a Gaussian distribution. How do we measure how good this approximation is? Using the KL divergence:

\[D_{KL}(q_\phi(z|x) \| p_\theta(z|x)) = \int q_\phi(z|x) \log \frac{q_\phi(z|x)}{p_\theta(z|x)} \, dz\]

We can’t compute this KL divergence directly (it involves the true posterior, which is intractable). But there’s a clever algebraic trick. Start with the log-evidence:

\[\log p_\theta(x) = D_{KL}(q_\phi(z|x) \| p_\theta(z|x)) + \mathcal{L}(\theta, \phi; x)\]

The ELBO is a lower bound on the log-evidence. Maximizing the ELBO simultaneously:

\[\mathcal{L}(\theta, \phi; x) = \underbrace{-D_{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization}} + \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}}\]

The tension between these two terms shapes the learned representation. The reconstruction term wants maximally informative codes (preserve every detail). The regularization term wants codes close to the prior (a featureless blob). The balance produces codes that capture the important structure in the data while discarding noise.

Worked Example

KL divergence (regularization term) between \(\mathcal{N}(\mu, \sigma^2)\) and \(\mathcal{N}(0, 1)\):

\[= -\frac{1}{2}(1 + \log(0.36) - 0.64 - 0.36) = -\frac{1}{2}(1 + (-1.022) - 0.64 - 0.36) = -\frac{1}{2}(-1.022) = 0.511\]

Reconstruction term: suppose we sample \(z = 0.8 + 0.6 \times 0.3 = 0.98\) (using noise \(\epsilon = 0.3\)), pass it through the decoder, and get \(\log p_\theta(x|z) = -4.2\).

Verification: The ELBO (\(-4.711\)) is indeed less than \(\log p_\theta(x) = -5.0\)… wait, \(-4.711 > -5.0\). This looks wrong, but it’s actually correct: \(-4.711\) is a larger number than \(-5.0\) on the number line. The bound says \(\log p_\theta(x) \geq \mathcal{L}\), which means \(-5.0 \geq -4.711\) – this is violated!

The issue: my made-up numbers are inconsistent. In reality, the ELBO is always \(\leq\) the true log-evidence. Let me fix this: if the ELBO is \(-4.711\), then the true log-evidence must be at least \(-4.711\). So \(\log p_\theta(x) \geq -4.711\).

This illustrates an important point: you typically don’t know the true log-evidence. The ELBO gives you a guaranteed floor. The tighter the ELBO (closer to the true value), the better your approximate posterior is.

Exercises

Recall: The ELBO has two terms. Which term pushes codes toward the prior, and which term pushes codes to be informative about the data? What happens if you only optimize one term?

Apply: Given encoder output \(q_\phi(z|x) = \mathcal{N}(1.5, 0.3^2)\) and prior \(p(z) = \mathcal{N}(0, 1)\), compute the KL divergence using the formula \(D_{KL} = -\frac{1}{2}(1 + \log \sigma^2 - \mu^2 - \sigma^2)\). Is this encoder output close to or far from the prior?

Extend: If the regularization term dominates (very strong prior pressure), what happens to the latent codes? What does this imply for the decoder’s ability to reconstruct the input? This failure mode is called posterior collapse – describe what it looks like in practice.

Lesson 4: The Encoder – Learning to Compress

We know we need a neural network \(q_\phi(z|x)\) to approximate the posterior. In this lesson we build it and understand how it turns the intractable inference problem into a single forward pass.

Explanation

In traditional variational inference, you run an optimization procedure per data point to find the best approximate posterior. This is like hiring a private detective for each case – thorough but expensive. The VAE replaces this with a single neural network that handles all data points at once. This approach is called amortized inference – the cost of inference is amortized across the entire dataset.

The encoder takes a data point \(x\) (a vector of \(D\) numbers, e.g., a flattened 28x28 MNIST image = 784 numbers) and outputs two vectors:

Why \(\log \sigma^2\) instead of \(\sigma\) directly? Because variance must be positive, and \(\log \sigma^2\) can be any real number – no constraint needed on the network output. We recover \(\sigma\) as \(\sigma = \exp(\frac{1}{2} \log \sigma^2)\).

The architecture in the paper is a single-hidden-layer MLP (multilayer perceptron):

\[h = \tanh(W_1 x + b_1)\] \[\mu = W_2 h + b_2\] \[\log \sigma^2 = W_3 h + b_3\]

Each latent dimension is independent (diagonal covariance). This is a simplification – the true posterior may have correlated dimensions – but it makes computation tractable.

Worked Example

Let’s build a tiny encoder: input dimension \(D = 4\), hidden dimension = 3, latent dimension \(J = 2\).

\[W_1 = \begin{bmatrix} 0.5 & -0.3 & 0.1 & 0.4 \\ -0.2 & 0.6 & -0.4 & 0.1 \\ 0.3 & 0.1 & 0.5 & -0.2 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}\]

\[W_2 = \begin{bmatrix} 0.4 & -0.3 & 0.2 \\ -0.1 & 0.5 & 0.3 \end{bmatrix}, \quad b_2 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]

\[W_3 = \begin{bmatrix} 0.2 & 0.1 & -0.3 \\ 0.1 & -0.2 & 0.4 \end{bmatrix}, \quad b_3 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]

\[W_1 x = \begin{bmatrix} 0.5(1.0) + (-0.3)(0.5) + 0.1(-0.5) + 0.4(0.8) \\ -0.2(1.0) + 0.6(0.5) + (-0.4)(-0.5) + 0.1(0.8) \\ 0.3(1.0) + 0.1(0.5) + 0.5(-0.5) + (-0.2)(0.8) \end{bmatrix} = \begin{bmatrix} 0.62 \\ 0.38 \\ 0.04 \end{bmatrix}\]

\[h = \tanh\left(\begin{bmatrix} 0.62 \\ 0.38 \\ 0.04 \end{bmatrix}\right) = \begin{bmatrix} 0.551 \\ 0.363 \\ 0.040 \end{bmatrix}\]

\[\mu = W_2 h = \begin{bmatrix} 0.4(0.551) + (-0.3)(0.363) + 0.2(0.040) \\ -0.1(0.551) + 0.5(0.363) + 0.3(0.040) \end{bmatrix} = \begin{bmatrix} 0.119 \\ 0.138 \end{bmatrix}\]

\[\log \sigma^2 = W_3 h = \begin{bmatrix} 0.2(0.551) + 0.1(0.363) + (-0.3)(0.040) \\ 0.1(0.551) + (-0.2)(0.363) + 0.4(0.040) \end{bmatrix} = \begin{bmatrix} 0.134 \\ -0.001 \end{bmatrix}\]

\[\sigma = \exp\left(\frac{1}{2} \begin{bmatrix} 0.134 \\ -0.001 \end{bmatrix}\right) = \begin{bmatrix} \exp(0.067) \\ \exp(-0.0005) \end{bmatrix} = \begin{bmatrix} 1.069 \\ 0.9995 \end{bmatrix}\]

So for input \(x = [1.0, 0.5, -0.5, 0.8]\), the encoder says: “the latent code is approximately \(\mathcal{N}([0.119, 0.138], [1.069^2, 0.9995^2])\).”

Both \(\sigma\) values are close to 1.0 and both \(\mu\) values are close to 0 – the encoder hasn’t learned much yet (it’s outputting near-prior distributions). After training, we’d expect \(\sigma\) values to shrink (more confident encoding) and \(\mu\) values to separate for different data points.

Exercises

Recall: Why does the encoder output \(\log \sigma^2\) instead of \(\sigma\) directly? What constraint does this avoid?

Apply: Given \(\log \sigma^2 = [-0.5, 1.2]\), compute \(\sigma\) for each dimension. Which latent dimension has more uncertainty?

Extend: The encoder uses a shared hidden layer \(h\) for both \(\mu\) and \(\log \sigma^2\). What would change if they had completely separate networks (no shared computation)? What advantage does sharing provide?

Lesson 5: The Decoder – Learning to Reconstruct

The decoder is the generative half of the VAE: given a latent code \(z\), it produces (a distribution over) data \(x\). The choice of output distribution determines the reconstruction loss.

Explanation

Think of the decoder as an artist who receives a brief description (the code \(z\)) and must recreate a painting. The description might say “face, smiling, left-facing” – the artist fills in all the details.

The decoder is a neural network \(p_\theta(x|z)\) that takes a latent code \(z\) (a vector of \(J\) numbers) and produces parameters for a distribution over \(x\). The architecture mirrors the encoder:

The activation function \(f\) and the output distribution depend on the data type:

Binary data (e.g., binarized MNIST where each pixel is 0 or 1): the decoder outputs probabilities via sigmoid, and each pixel is modeled as an independent Bernoulli (a distribution over two outcomes, 0 or 1, parameterized by a single probability):

where \(y = \sigma(W_5 h + b_5)\) is the vector of pixel probabilities. The reconstruction loss is binary cross-entropy:

\[\log p_\theta(x|z) = \sum_{d=1}^{D} \left[ x_d \log y_d + (1 - x_d) \log(1 - y_d) \right]\]

Continuous data (e.g., Frey Face dataset): the decoder outputs a mean (and optionally variance), and each dimension is modeled as an independent Gaussian:

\[\log p_\theta(x|z) = -\frac{D}{2}\log(2\pi) - \frac{1}{2}\sum_{d=1}^{D} \left[\log \sigma_d^2 + \frac{(x_d - \mu_d)^2}{\sigma_d^2}\right]\]

When the variance is fixed at 1, this simplifies to the mean squared error (up to a constant):

This is why VAEs produce blurry images: minimizing squared error is equivalent to predicting the average of all plausible reconstructions, not a single sharp one.

Worked Example

Setup: latent dimension \(J = 2\), hidden dimension = 3, output dimension \(D = 4\).

\[W_4 = \begin{bmatrix} 0.3 & -0.5 \\ 0.7 & 0.2 \\ -0.1 & 0.6 \end{bmatrix}, \quad b_4 = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}\]

\[W_5 = \begin{bmatrix} 0.4 & 0.2 & -0.3 \\ -0.1 & 0.5 & 0.3 \\ 0.3 & -0.2 & 0.6 \\ 0.1 & 0.4 & -0.1 \end{bmatrix}, \quad b_5 = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}\]

\[W_4 z = \begin{bmatrix} 0.3(0.5) + (-0.5)(-0.3) \\ 0.7(0.5) + 0.2(-0.3) \\ -0.1(0.5) + 0.6(-0.3) \end{bmatrix} = \begin{bmatrix} 0.30 \\ 0.29 \\ -0.23 \end{bmatrix}\]

\[h = \tanh\left(\begin{bmatrix} 0.30 \\ 0.29 \\ -0.23 \end{bmatrix}\right) = \begin{bmatrix} 0.291 \\ 0.282 \\ -0.225 \end{bmatrix}\]

\[W_5 h = \begin{bmatrix} 0.4(0.291) + 0.2(0.282) + (-0.3)(-0.225) \\ -0.1(0.291) + 0.5(0.282) + 0.3(-0.225) \\ 0.3(0.291) + (-0.2)(0.282) + 0.6(-0.225) \\ 0.1(0.291) + 0.4(0.282) + (-0.1)(-0.225) \end{bmatrix} = \begin{bmatrix} 0.240 \\ 0.044 \\ -0.104 \\ 0.164 \end{bmatrix}\]

\[y = \sigma\left(\begin{bmatrix} 0.240 \\ 0.044 \\ -0.104 \\ 0.164 \end{bmatrix}\right) = \begin{bmatrix} 0.560 \\ 0.511 \\ 0.474 \\ 0.541 \end{bmatrix}\]

\[\log p_\theta(x|z) = 1 \cdot \log(0.560) + 0 \cdot \log(0.511) + 1 \cdot \log(1 - 0.511) + 0 \cdot \log(0.474) + 1 \cdot \log(0.474) + 0 \cdot \log(0.541) + 1 \cdot \log(0.541)\]

The reconstruction log-likelihood is \(-2.656\). A perfect decoder would give \(\log p = 0\) (probabilities of 1.0 for each correct pixel). This decoder is far from perfect – it’s essentially guessing near 0.5 for everything.

Exercises

Recall: What is the difference between a Bernoulli decoder and a Gaussian decoder? When would you use each?

Apply: A Bernoulli decoder outputs probabilities \(y = [0.9, 0.1, 0.8, 0.3]\) for a binary target \(x = [1, 0, 1, 0]\). Compute \(\log p_\theta(x|z)\). Is this a good reconstruction?

Extend: Why do Gaussian decoders with fixed variance produce blurry images? Think about what the decoder learns to output when the same code \(z\) maps to slightly different images \(x_1\) and \(x_2\) – what minimizes \(\|x - \mu\|^2\) averaged over both?

Lesson 6: The Reparameterization Trick

This is the paper’s central technical contribution. We have an encoder that outputs \(\mu\) and \(\sigma\), and we need to sample \(z\) from \(\mathcal{N}(\mu, \sigma^2)\) to pass to the decoder. But sampling is a random operation – gradients can’t flow backward through a dice roll. The reparameterization trick solves this.

Explanation

Imagine a factory assembly line. Parts move from Station A (encoder) to Station B (decoder). Between them, a randomizer shuffles parts – sometimes a part goes left, sometimes right, unpredictably. You want to improve Station A based on feedback from Station B, but the randomizer breaks the chain of cause and effect. You can’t trace “this defective output came from that specific input” because the randomizer scrambled things.

The trick: move the randomizer before Station A. Now the random input enters the assembly line at the very beginning, and everything after that is a deterministic function. You can trace any output back to its input and compute exactly how changing Station A’s settings would change the output.

The distribution of \(z\) is unchanged – it’s still \(\mathcal{N}(\mu, \sigma^2)\). But now the gradient \(\frac{\partial z}{\partial \mu} = 1\) and \(\frac{\partial z}{\partial \sigma} = \epsilon\) are well-defined. Backpropagation works through the entire encoder-decoder pipeline.

This is the insight that made the entire VAE trainable. Before this trick, variational inference with neural networks required high-variance gradient estimators like REINFORCE (a method that estimates gradients by weighting sampled function values by their log-probability scores) that were too noisy for practical use. The reparameterization trick produces low-variance gradients because the randomness is moved outside the computational graph.

Worked Example

Encoder output for some input \(x\): \(\mu = [1.2, -0.5]\), \(\sigma = [0.8, 0.3]\)

\[z = \mu + \sigma \odot \epsilon = [1.2, -0.5] + [0.8, 0.3] \odot [0.7, -1.1]\]

\[\frac{\partial z_1}{\partial \mu_1} = 1, \quad \frac{\partial z_1}{\partial \sigma_1} = \epsilon_1 = 0.7\]

\[\frac{\partial z_2}{\partial \mu_2} = 1, \quad \frac{\partial z_2}{\partial \sigma_2} = \epsilon_2 = -1.1\]

\[\frac{\partial z_1}{\partial \mu_2} = 0, \quad \frac{\partial z_2}{\partial \mu_1} = 0 \quad \text{(dimensions are independent)}\]

These gradients are clean and well-defined. Suppose the loss (negative ELBO) has gradient \(\frac{\partial \mathcal{L}}{\partial z} = [0.3, -0.2]\) coming back from the decoder. By the chain rule:

\[\frac{\partial \mathcal{L}}{\partial \mu_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial \mu_1} = 0.3 \times 1 = 0.3\]

\[\frac{\partial \mathcal{L}}{\partial \sigma_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial \sigma_1} = 0.3 \times 0.7 = 0.21\]

The gradients flow cleanly from the loss, through \(z\), back to \(\mu\) and \(\sigma\), and from there back through the encoder network to update \(\phi\). The entire encoder-decoder system is trained end-to-end.

Without the trick: \(z\) is sampled directly from \(\mathcal{N}(\mu, \sigma^2)\). The sampling operation has no well-defined gradient with respect to \(\mu\) and \(\sigma\). The alternative (REINFORCE estimator) would give \(\frac{\partial}{\partial \phi} \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{q_\phi}[f(z) \nabla_\phi \log q_\phi(z|x)]\), which has much higher variance and requires many samples to get a usable gradient signal.

Exercises

Recall: In the reparameterization trick, what is held fixed during backpropagation: \(\mu\), \(\sigma\), or \(\epsilon\)? Which quantities receive gradients?

Apply: Given \(\mu = [0.5]\), \(\sigma = [1.2]\), and two noise samples \(\epsilon_1 = 0.3\) and \(\epsilon_2 = -0.8\), compute \(z_1\) and \(z_2\). Then compute \(\frac{\partial z}{\partial \mu}\) and \(\frac{\partial z}{\partial \sigma}\) for each. Which sample gives a larger gradient for \(\sigma\)?

Extend: The reparameterization trick only works for distributions where sampling can be expressed as a deterministic transformation of fixed noise. It works for Gaussians (\(z = \mu + \sigma \epsilon\)) and some other distributions. Can you think of a distribution where this trick would not work? (Hint: consider discrete distributions, like a Bernoulli – can you express “sample 0 or 1 with probability \(p\)” as a smooth function of \(p\) and some noise?)

Lesson 7: The Complete VAE – Training and Generation

We now have all the pieces. This lesson assembles them into the complete training algorithm and shows how to generate new data.

Explanation

The VAE training algorithm (Algorithm 1 in the paper) processes data in minibatches (small random subsets of the training data, processed together in one update step). For each minibatch, it computes the ELBO and updates all parameters by gradient ascent.

The complete loss for a single data point, when both the prior and approximate posterior are Gaussian, has a convenient closed form:

\[\mathcal{L}(\theta, \phi; x^{(i)}) \simeq \frac{1}{2} \sum_{j=1}^{J} \left(1 + \log((\sigma_j^{(i)})^2) - (\mu_j^{(i)})^2 - (\sigma_j^{(i)})^2\right) + \frac{1}{L} \sum_{l=1}^{L} \log p_\theta(x^{(i)} | z^{(i,l)})\]

where \(z^{(i,l)} = \mu^{(i)} + \sigma^{(i)} \odot \epsilon^{(l)}\) and \(\epsilon^{(l)} \sim \mathcal{N}(0, I)\).

That’s it. No adversarial training, no MCMC chains, no iterative optimization. A single forward pass through the decoder turns noise into data.

The learned MNIST manifold with a 2D latent space. Each position in the grid is decoded into a digit image. Smooth transitions between classes (6 blending into 0, 9 into 7) show the VAE learned a continuous, structured representation where nearby codes produce similar images.

The Frey Face manifold in 2D latent space. Moving through the latent space smoothly varies facial expression and pose. The VAE discovered these meaningful factors without any labels.

Random samples generated by sampling \(z \sim \mathcal{N}(0, I)\) and decoding. Digits are recognizable but blurry – the characteristic VAE limitation caused by the Gaussian reconstruction loss, which averages over possible outputs rather than committing to sharp details.

Worked Example

Setup: latent dimension \(J = 2\), data dimension \(D = 3\) (binary), \(L = 1\) sample.

\[z = [0.8, -0.3] + [0.779, 1.105] \odot [0.4, -0.7] = [0.8, -0.3] + [0.312, -0.774] = [1.112, -1.074]\]

\[\text{KL}_1 = -\frac{1}{2}(1 + (-0.5) - 0.64 - 0.607) = -\frac{1}{2}(-0.747) = 0.374\]

\[\text{KL}_2 = -\frac{1}{2}(1 + 0.2 - 0.09 - 1.221) = -\frac{1}{2}(-0.111) = 0.056\]

We maximize this, which means minimizing \(-\mathcal{L} = 1.133\). The KL penalty (\(0.430\)) is the cost of the encoder deviating from the prior. The reconstruction cost (\(0.703\)) is the price of imperfect decoding.

Compute \(\nabla_\theta \mathcal{L}\) and \(\nabla_\phi \mathcal{L}\) via backpropagation through the entire encoder → sample → decoder pipeline (the reparameterization trick makes this possible). Update all weights.

Exercises

Recall: In the VAE loss function, which term can be computed in closed form (no sampling) and which requires sampling through the reparameterization trick?

Apply: For a 3D latent space with encoder outputs \(\mu = [0.5, -1.0, 0.2]\) and \(\sigma = [0.8, 0.5, 1.1]\), compute the KL divergence \(D_{KL}(q_\phi \| p)\) using the closed-form expression. Which latent dimension contributes the most to the KL divergence?

Extend: The paper found that \(L = 1\) (a single noise sample per data point) works well with minibatch size \(M = 100\). Why does a larger minibatch compensate for fewer samples? Think about what happens to the gradient variance as \(M\) increases.

Comprehension Questions

Hands-On Project

Goal

Implement a VAE from scratch using numpy to learn a 2D latent representation of toy data sampled from a mixture of three Gaussians.

Specification

Starter Code

import numpy as np

np.random.seed(42)

# ── Hyperparameters ──────────────────────────────────────────────
INPUT_DIM = 2
LATENT_DIM = 2
HIDDEN_DIM = 16
BATCH_SIZE = 64
LR = 0.001
TRAIN_STEPS = 5000
DATA_VARIANCE = 0.1  # fixed decoder variance

# ── Generate toy data ────────────────────────────────────────────
# Three Gaussian clusters forming a triangle
CENTERS = np.array([[3.0, 0.0], [-1.5, 2.6], [-1.5, -2.6]])
CLUSTER_STD = 0.4

def sample_data(n):
    """Sample n points from the mixture of 3 Gaussians."""
    labels = np.random.randint(0, 3, n)
    points = CENTERS[labels] + np.random.randn(n, 2) * CLUSTER_STD
    return points


# ── Activation functions ─────────────────────────────────────────
def sigmoid(x):
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))


def tanh(x):
    return np.tanh(x)


def dtanh(t):
    """Derivative of tanh, given tanh output t."""
    return 1 - t ** 2


# ── Weight initialization ───────────────────────────────────────
def init_weights(fan_in, fan_out):
    scale = np.sqrt(2.0 / (fan_in + fan_out))
    return np.random.randn(fan_in, fan_out) * scale


# ── Encoder ──────────────────────────────────────────────────────
# Architecture: input(2) -> hidden(16) -> mu(2), logvar(2)
enc_W1 = init_weights(INPUT_DIM, HIDDEN_DIM)
enc_b1 = np.zeros((1, HIDDEN_DIM))
enc_W_mu = init_weights(HIDDEN_DIM, LATENT_DIM)
enc_b_mu = np.zeros((1, LATENT_DIM))
enc_W_logvar = init_weights(HIDDEN_DIM, LATENT_DIM)
enc_b_logvar = np.zeros((1, LATENT_DIM))


def encoder_forward(x):
    """
    Forward pass through encoder.
    x: shape (batch, 2)
    Returns: (mu, logvar, cache)
      mu: shape (batch, 2) -- mean of q(z|x)
      logvar: shape (batch, 2) -- log variance of q(z|x)
      cache: values needed for backward pass
    """
    # TODO: Implement forward pass
    # Layer 1: linear + tanh
    # Output heads: mu (linear) and logvar (linear)
    pass


def encoder_backward(d_mu, d_logvar, cache):
    """
    Backward pass through encoder.
    d_mu: gradient of loss w.r.t. mu, shape (batch, 2)
    d_logvar: gradient of loss w.r.t. logvar, shape (batch, 2)
    cache: saved values from forward pass
    Returns: dict of gradients for each weight matrix and bias
    """
    # TODO: Implement backward pass
    # Combine gradients from both output heads through the shared hidden layer
    pass


# ── Decoder ──────────────────────────────────────────────────────
# Architecture: latent(2) -> hidden(16) -> output(2)
dec_W1 = init_weights(LATENT_DIM, HIDDEN_DIM)
dec_b1 = np.zeros((1, HIDDEN_DIM))
dec_W2 = init_weights(HIDDEN_DIM, INPUT_DIM)
dec_b2 = np.zeros((1, INPUT_DIM))


def decoder_forward(z):
    """
    Forward pass through decoder (Gaussian with fixed variance).
    z: shape (batch, 2)
    Returns: (x_recon, cache)
      x_recon: shape (batch, 2) -- reconstructed mean
      cache: values needed for backward pass
    """
    # TODO: Implement forward pass
    # Layer 1: linear + tanh
    # Layer 2: linear (no activation -- output is the Gaussian mean)
    pass


def decoder_backward(d_output, cache):
    """
    Backward pass through decoder.
    d_output: gradient of loss w.r.t. decoder output, shape (batch, 2)
    cache: saved values from forward pass
    Returns: (dict of gradients, d_z) where d_z is gradient w.r.t. input z
    """
    # TODO: Implement backward pass
    pass


# ── Loss computation ─────────────────────────────────────────────
def compute_loss(x, x_recon, mu, logvar):
    """
    Compute the negative ELBO (the loss to minimize).

    Returns: (loss, d_x_recon, d_mu, d_logvar)
      loss: scalar, the negative ELBO averaged over the batch
      d_x_recon: gradient of loss w.r.t. x_recon
      d_mu: gradient of loss w.r.t. mu
      d_logvar: gradient of loss w.r.t. logvar
    """
    batch_size = x.shape[0]

    # TODO: Compute reconstruction loss (Gaussian log-likelihood with fixed variance)
    # recon_loss = 0.5 * sum((x - x_recon)^2) / DATA_VARIANCE
    # This is -log p(x|z) up to a constant

    # TODO: Compute KL divergence (closed form for diagonal Gaussian vs N(0,I))
    # kl_loss = -0.5 * sum(1 + logvar - mu^2 - exp(logvar))

    # TODO: Total loss = (recon_loss + kl_loss) / batch_size

    # TODO: Compute gradients of loss w.r.t. x_recon, mu, logvar
    # d_x_recon = (x_recon - x) / (DATA_VARIANCE * batch_size)
    # d_mu = mu / batch_size
    # d_logvar = 0.5 * (exp(logvar) - 1) / batch_size
    pass


# ── Reparameterization trick ────────────────────────────────────
def reparameterize(mu, logvar):
    """
    Sample z = mu + sigma * epsilon, where epsilon ~ N(0, I).
    Returns: (z, epsilon)
    """
    # TODO: Implement the reparameterization trick
    pass


# ── Training loop ────────────────────────────────────────────────
for step in range(TRAIN_STEPS):
    # Sample minibatch
    x = sample_data(BATCH_SIZE)

    # ── Forward pass ──────────────────────────────────────────
    # 1. Encode: x -> mu, logvar
    # 2. Reparameterize: sample z
    # 3. Decode: z -> x_recon
    # 4. Compute loss and gradients

    # TODO: Implement forward pass and loss computation

    # ── Backward pass ─────────────────────────────────────────
    # 1. Backprop through decoder to get gradients for decoder weights and d_z
    # 2. Backprop through reparameterization: d_mu += d_z, d_logvar += d_z * 0.5 * epsilon * exp(0.5 * logvar)
    # 3. Backprop through encoder to get gradients for encoder weights

    # TODO: Implement backward pass

    # ── Parameter update (SGD) ────────────────────────────────
    # TODO: Update all weights and biases with learning rate LR

    # ── Logging ───────────────────────────────────────────────
    if step % 500 == 0:
        # Compute loss on a larger batch for stable reporting
        x_eval = sample_data(1000)
        mu_e, logvar_e, _ = encoder_forward(x_eval)
        z_e, _ = reparameterize(mu_e, logvar_e)
        x_recon_e, _ = decoder_forward(z_e)
        recon = 0.5 * np.sum((x_eval - x_recon_e) ** 2) / (DATA_VARIANCE * 1000)
        kl = -0.5 * np.sum(1 + logvar_e - mu_e ** 2 - np.exp(logvar_e)) / 1000
        print(f"Step {step:5d} | Loss: {recon + kl:.3f} | "
              f"Recon: {recon:.3f} | KL: {kl:.3f}")

# ── Generation ───────────────────────────────────────────────────
print("\nGenerating new samples from the prior...")
z_samples = np.random.randn(10, LATENT_DIM)
generated, _ = decoder_forward(z_samples)
print("Generated points:")
for i in range(10):
    print(f"  z=[{z_samples[i,0]:+.2f}, {z_samples[i,1]:+.2f}] -> "
          f"x=[{generated[i,0]:+.2f}, {generated[i,1]:+.2f}]")

# ── Verify cluster structure ─────────────────────────────────────
print("\nCluster centers in data space:", CENTERS.tolist())
print("Do generated points cluster near the three centers?")

Auto-Encoding Variational Bayes: Course

Learning Objectives

Prerequisites

Lesson 1: Latent Variables – The Hidden Causes of Data

Explanation

Worked Example

Exercises

Lesson 2: The Inference Problem

Explanation

Worked Example

Exercises

Lesson 3: Variational Inference and the ELBO

Explanation

Worked Example

Exercises

Lesson 4: The Encoder – Learning to Compress

Explanation

Worked Example

Exercises

Lesson 5: The Decoder – Learning to Reconstruct

Explanation

Worked Example

Exercises

Lesson 6: The Reparameterization Trick

Explanation

Worked Example

Exercises

Lesson 7: The Complete VAE – Training and Generation

Explanation

Worked Example

Exercises

Comprehension Questions

Hands-On Project

Goal

Specification

Starter Code

Expected Output

Further Reading

Looking Ahead