By the end of this course, you will be able to:
np.dot(W, x).Before we can understand GANs, we need to understand what a discriminator does. A discriminator is just a binary classifier: given an input, it answers “real or fake?” with a number between 0 and 1.
Think of a quality inspector at a factory. Products come down a conveyor belt – some are genuine, some are defective counterfeits that slipped in. The inspector examines each product and decides: “genuine” or “counterfeit.” A perfect inspector correctly identifies every item. A useless inspector just guesses randomly (50/50).
A discriminator \(D(x)\) is a function that takes a data sample \(x\) and outputs a probability: \(D(x) = 0.9\) means “I’m 90% sure this is real.” We want \(D(x)\) close to 1 for real data and close to 0 for fake data.
How do we train such a classifier? We use binary cross-entropy loss (also called log-loss). If we have a sample \(x\) with a true label \(y\) (1 for real, 0 for fake), the loss for that sample is:
\[L(D(x), y) = -[y \log D(x) + (1 - y) \log(1 - D(x))]\]
The discriminator minimizes this loss over a batch of real and fake samples. Equivalently, it maximizes:
\[\frac{1}{m} \sum_{i=1}^{m} \left[\log D(x^{(i)}) + \log(1 - D(x_{\text{fake}}^{(i)}))\right]\]
where \(x^{(i)}\) are real samples and \(x_{\text{fake}}^{(i)}\) are fake samples.
The simplest discriminator is a single-layer neural network: a linear transformation followed by a sigmoid activation:
\[D(x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}\]
Let’s build a discriminator for 2D data. Suppose our real data comes from a cluster centered at \([2, 2]\) and our fake data comes from a cluster centered at \([0, 0]\).
We have weights \(w = [0.5, 0.5]\) and bias \(b = -1.0\).
Real sample: \(x = [2.1, 1.9]\)
\[w^T x + b = 0.5 \times 2.1 + 0.5 \times 1.9 + (-1.0) = 1.05 + 0.95 - 1.0 = 1.0\] \[D(x) = \sigma(1.0) = \frac{1}{1 + e^{-1.0}} = \frac{1}{1 + 0.368} = 0.731\]
The discriminator says this is 73.1% likely to be real. Not bad, but not confident yet.
Fake sample: \(x_{\text{fake}} = [0.3, -0.2]\)
\[w^T x_{\text{fake}} + b = 0.5 \times 0.3 + 0.5 \times (-0.2) + (-1.0) = 0.15 - 0.1 - 1.0 = -0.95\] \[D(x_{\text{fake}}) = \sigma(-0.95) = \frac{1}{1 + e^{0.95}} = \frac{1}{1 + 2.586} = 0.279\]
The discriminator says this is 27.9% likely to be real (72.1% likely to be fake). Reasonable.
Loss for this pair:
\[L = -[\log(0.731) + \log(1 - 0.279)] = -[-0.313 + (-0.327)] = 0.640\]
Gradient update: The gradient of the loss with respect to \(w\) points in the direction that would reduce the loss. Training moves \(w\) to make \(D\) output higher values for real samples and lower values for fake samples. After many updates, the decision boundary shifts until the discriminator becomes more confident.
Recall: What is the range of the sigmoid function \(\sigma(z)\)? What value does \(\sigma(0)\) return?
Apply: Given \(w = [1.0, -0.5]\), \(b = 0.0\), compute \(D(x)\) for \(x = [1.0, 2.0]\). Is the discriminator confident that this sample is real or fake?
Extend: If the real and fake data distributions overlap completely (identical distributions), what would the optimal discriminator output for every input? Why? (Hint: think about what a Bayesian classifier would do when \(P(\text{real}|x) = P(\text{fake}|x)\).)
Now that we know what a discriminator does, we need something to generate the fake samples. The generator takes random noise and transforms it into something that looks like real data.
Think of an artist who creates paintings. The artist doesn’t copy existing paintings – they start with a blank canvas (the random noise) and apply their skills (the neural network weights) to produce something new. A beginner artist produces scribbles that are obviously not real paintings. A master artist produces works that could fool an art critic.
A generator \(G(z)\) takes a noise vector \(z\) drawn from a simple distribution (like a standard Gaussian \(z \sim \mathcal{N}(0, I)\)) and transforms it into a data sample in the same space as the real data. If the real data is 2D points, \(G\) outputs 2D points. If the real data is 28x28 images, \(G\) outputs 28x28 images.
The simplest generator is a linear transformation followed by an activation function:
\[G(z) = \tanh(W_g z + b_g)\]
The key idea is that different noise vectors \(z\) produce different outputs \(G(z)\). As training progresses, the generator learns a mapping that transforms the simple noise distribution into the complex data distribution. Points that are close in \(z\)-space produce similar outputs, so the generator learns a smooth manifold of data-like samples.
Let’s build a generator that maps 2D noise to 2D data. Our target is data centered around \([2, 2]\).
Starting weights: \(W_g = [[0.5, 0.3], [0.2, 0.7]]\), \(b_g = [1.0, 1.0]\).
Noise sample: \(z = [0.5, -0.3]\) (drawn from standard Gaussian)
\[W_g z + b_g = \begin{bmatrix} 0.5 & 0.3 \\ 0.2 & 0.7 \end{bmatrix} \begin{bmatrix} 0.5 \\ -0.3 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix}\]
\[= \begin{bmatrix} 0.5 \times 0.5 + 0.3 \times (-0.3) \\ 0.2 \times 0.5 + 0.7 \times (-0.3) \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} 0.25 - 0.09 \\ 0.10 - 0.21 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} 1.16 \\ 0.89 \end{bmatrix}\]
\[G(z) = \tanh\left(\begin{bmatrix} 1.16 \\ 0.89 \end{bmatrix}\right) = \begin{bmatrix} 0.821 \\ 0.712 \end{bmatrix}\]
The generator produced the point \([0.821, 0.712]\). The real data is centered at \([2, 2]\), so this fake is far off – a beginner counterfeiter. Through training, the weights \(W_g\) and \(b_g\) will shift so that the output moves toward \([2, 2]\).
Second noise sample: \(z = [-0.8, 1.2]\)
\[W_g z + b_g = \begin{bmatrix} 0.5 \times (-0.8) + 0.3 \times 1.2 \\ 0.2 \times (-0.8) + 0.7 \times 1.2 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} -0.04 \\ 0.68 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} 0.96 \\ 1.68 \end{bmatrix}\]
\[G(z) = \tanh\left(\begin{bmatrix} 0.96 \\ 1.68 \end{bmatrix}\right) = \begin{bmatrix} 0.744 \\ 0.932 \end{bmatrix}\]
Different noise \(\rightarrow\) different output. Both are still far from \([2, 2]\), but they are different from each other, showing that the generator is not collapsed to a single point.
Recall: Why does the generator take random noise as input instead of, say, a fixed vector? What would happen if every call to \(G\) used the same input?
Apply: Given \(W_g = [[1.0, 0.0], [0.0, 1.0]]\) (identity matrix) and \(b_g = [2.0, 2.0]\), compute \(G(z)\) for \(z = [0.1, -0.1]\) using \(G(z) = \tanh(W_g z + b_g)\). How close is the output to \([2, 2]\)?
Extend: The \(\tanh\) function constrains outputs to \((-1, 1)\). What problem does this create if the real data has values outside this range (e.g., pixel values in \([0, 255]\))? How might you fix this?
Before we can analyze GAN training theoretically, we need to know: for a fixed generator \(G\), what is the best possible discriminator?
Imagine you’re a statistician given two piles of coins: one pile is genuine currency, the other is counterfeits from a specific counterfeiter. You want to build the best possible classifier to distinguish them. With enough coins from each pile, you can estimate the probability of seeing any particular feature (weight, color, texture) in each pile. The optimal strategy is Bayes’ rule: for a coin with feature \(x\), classify it as genuine if \(P(\text{genuine} | x) > P(\text{counterfeit} | x)\).
For GANs, the same principle applies. The real data has distribution \(p_\text{data}(x)\) and the generator’s output has distribution \(p_g(x)\). The optimal discriminator applies Bayes’ rule:
\[D^*_G(x) = \frac{p_\text{data}(x)}{p_\text{data}(x) + p_g(x)}\]
This comes from maximizing the discriminator’s objective. The discriminator wants to maximize:
\[V(D) = \int_x p_\text{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \, dx\]
For each point \(x\), this is a function of the form \(a \log(y) + b \log(1 - y)\), where \(a = p_\text{data}(x)\) and \(b = p_g(x)\). Taking the derivative with respect to \(y\), setting it to zero, and solving gives \(y = a/(a+b)\).
The critical insight: when the generator perfectly matches the data (\(p_g = p_\text{data}\)), the optimal discriminator outputs \(D^*(x) = 1/2\) for every \(x\). It cannot tell real from fake because there is no difference. This is the equilibrium point.
Suppose both the real data and the generator produce 1D data. At a specific point \(x = 3.0\):
The optimal discriminator at this point:
\[D^*(3.0) = \frac{0.25}{0.25 + 0.10} = \frac{0.25}{0.35} = 0.714\]
The discriminator is 71.4% confident this point is real. This makes sense – real data is 2.5x more likely to produce this value than the generator.
Now suppose the generator improves and matches the data at this point: \(p_g(3.0) = 0.25\).
\[D^*(3.0) = \frac{0.25}{0.25 + 0.25} = \frac{0.25}{0.50} = 0.500\]
The discriminator is now at 50% – maximum uncertainty. It genuinely cannot tell.
Let’s verify the optimality by checking nearby values. At \(D = 0.5\), the objective per-point is:
\[0.25 \times \log(0.5) + 0.25 \times \log(0.5) = 0.25 \times (-0.693) + 0.25 \times (-0.693) = -0.347\]
At \(D = 0.6\):
\[0.25 \times \log(0.6) + 0.25 \times \log(0.4) = 0.25 \times (-0.511) + 0.25 \times (-0.916) = -0.357\]
Indeed, \(D = 0.5\) gives a higher (less negative) objective than \(D = 0.6\), confirming optimality.
Recall: When the generator perfectly matches the data distribution, what does the optimal discriminator output? What does this mean intuitively?
Apply: At point \(x = 5.0\), the real data density is \(p_\text{data}(5.0) = 0.05\) and the generator density is \(p_g(5.0) = 0.20\). Compute \(D^*(5.0)\). Is this point more likely to be real or fake?
Extend: Suppose the generator puts zero probability mass on some region (i.e., \(p_g(x) = 0\) for some \(x\)). What does \(D^*(x)\) equal in that region? What problem might this create for training? (Hint: think about what gradient signal \(G\) receives for samples it never produces.)
We now have all the pieces: a discriminator, a generator, and the optimal discriminator formula. In this lesson, we put them together into the adversarial game and discover what it’s really optimizing.
Think of a thermostat and a heater. The thermostat measures temperature (like the discriminator measuring “realness”) and the heater adjusts its output (like the generator adjusting its samples). When they are in equilibrium, the room temperature matches the target – neither the thermostat nor the heater needs to change. The minimax game in GANs works similarly: the two networks push each other until they reach an equilibrium where the generator’s distribution matches the data.
The GAN training objective is:
\[\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_\text{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log (1 - D(G(z)))]\]
Reading from right to left: for a fixed generator \(G\), the discriminator \(D\) maximizes \(V\) (it tries to correctly classify real vs. fake). Then the generator \(G\) minimizes the result (it tries to fool the best possible discriminator).
When we plug in the optimal discriminator \(D^*_G\), the generator’s objective simplifies to:
\[C(G) = -\log(4) + 2 \cdot JSD(p_\text{data} \| p_g)\]
The Jensen-Shannon divergence (JSD) measures how different two probability distributions are. It is defined as:
\[JSD(p \| q) = \frac{1}{2} KL\left(p \left\| \frac{p + q}{2}\right.\right) + \frac{1}{2} KL\left(q \left\| \frac{p + q}{2}\right.\right)\]
where \(KL(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} dx\) is the Kullback-Leibler divergence.
Key properties of JSD: - Always \(\geq 0\) - Equals 0 if and only if \(p = q\) (the distributions are identical) - Symmetric: \(JSD(p \| q) = JSD(q \| p)\) (unlike KL divergence) - Bounded: \(0 \leq JSD(p \| q) \leq \log 2\)
Since \(C(G) = -\log 4 + 2 \cdot JSD\), and JSD is minimized (equals 0) only when \(p_g = p_\text{data}\), the global minimum is \(C(G) = -\log 4 \approx -1.386\), achieved when the generator perfectly matches the data.
This is the paper’s central theoretical result: the minimax game has a unique global optimum, and it corresponds to the generator learning the true data distribution.
Let’s compute JSD for two simple discrete distributions.
Distribution P (real data): \(P = [0.7, 0.2, 0.1]\) over three outcomes Distribution Q (generator): \(Q = [0.3, 0.3, 0.4]\) over three outcomes
First, compute the mixture \(M = (P + Q) / 2 = [0.5, 0.25, 0.25]\).
KL(P || M):
\[KL(P \| M) = 0.7 \log\frac{0.7}{0.5} + 0.2 \log\frac{0.2}{0.25} + 0.1 \log\frac{0.1}{0.25}\]
\[= 0.7 \times 0.336 + 0.2 \times (-0.223) + 0.1 \times (-0.916)\]
\[= 0.235 - 0.045 - 0.092 = 0.099\]
KL(Q || M):
\[KL(Q \| M) = 0.3 \log\frac{0.3}{0.5} + 0.3 \log\frac{0.3}{0.25} + 0.4 \log\frac{0.4}{0.25}\]
\[= 0.3 \times (-0.511) + 0.3 \times 0.182 + 0.4 \times 0.470\]
\[= -0.153 + 0.055 + 0.188 = 0.089\]
JSD(P || Q):
\[JSD = \frac{1}{2}(0.099 + 0.089) = 0.094\]
Generator cost:
\[C(G) = -\log 4 + 2 \times 0.094 = -1.386 + 0.188 = -1.198\]
This is higher than the optimum \(-1.386\), confirming the generator hasn’t matched the data yet. The difference \(0.188\) is the “price” the generator pays for not matching \(P\).
Now suppose \(Q\) improves to \(Q' = [0.7, 0.2, 0.1] = P\):
\[JSD(P \| Q') = 0, \quad C(G) = -1.386\]
The minimum is reached.
Recall: What quantity does GAN training implicitly minimize? Is this quantity symmetric (does \(JSD(p \| q) = JSD(q \| p)\))?
Apply: Compute \(JSD(P \| Q)\) for \(P = [0.5, 0.5]\) and \(Q = [0.5, 0.5]\) (identical distributions). Then compute it for \(P = [1.0, 0.0]\) and \(Q = [0.0, 1.0]\) (maximally different). Use natural log.
Extend: The KL divergence \(KL(p \| q)\) is undefined when \(q(x) = 0\) but \(p(x) > 0\) (because \(\log(p/0) = \infty\)). Explain why the JSD avoids this problem. Why does this make JSD a better choice for GAN training than KL divergence?
This is the payoff lesson. We combine the discriminator, generator, optimal discriminator theory, and minimax game into the actual training algorithm. We also confront the practical challenges that make GAN training notoriously difficult.
Recall the counterfeiter and detective analogy. Training a GAN is like running this competition in rounds:
Round 1 (Update D): 1. Sample \(m\) real data points from the training set 2. Sample \(m\) noise vectors and pass them through \(G\) to get \(m\) fake data points 3. Update \(D\)’s weights to better distinguish real from fake (gradient ascent on \(D\)’s objective) 4. Repeat for \(k\) steps (the paper uses \(k = 1\))
Round 2 (Update G): 1. Sample \(m\) noise vectors and pass them through \(G\) 2. Pass the fakes through \(D\) to get scores 3. Update \(G\)’s weights to make \(D\) output higher scores for the fakes (gradient descent on \(G\)’s objective)
The discriminator update computes this gradient and ascends:
\[\nabla_{\theta_d} \frac{1}{m} \sum_{i=1}^m \left[ \log D(x^{(i)}) + \log(1 - D(G(z^{(i)}))) \right]\]
The generator update computes this gradient and descends:
\[\nabla_{\theta_g} \frac{1}{m} \sum_{i=1}^m \log(1 - D(G(z^{(i)})))\]
The practical trick: Early in training, \(D\) easily distinguishes the terrible fakes, so \(D(G(z)) \approx 0\) and \(\log(1 - D(G(z))) \approx \log(1) = 0\). The gradient is nearly zero – the generator receives almost no learning signal. The fix: instead of minimizing \(\log(1 - D(G(z)))\), maximize \(\log D(G(z))\). When \(D(G(z)) \approx 0\), the gradient of \(\log D(G(z))\) is very large, providing a strong push for \(G\) to improve.
GANs are notoriously hard to train. Two major failure modes:
Mode collapse (the “Helvetica scenario”): The generator finds one output that fools the discriminator and maps every noise vector \(z\) to that same output. For image generation, this means producing the same image regardless of input noise. The generator has “collapsed” to a single mode of the data distribution.
Oscillation: Instead of converging, \(G\) and \(D\) chase each other in circles. \(G\) learns to fool \(D\) in one way, \(D\) adapts, \(G\) switches to a different strategy, and neither settles down. The loss oscillates without converging.
Both failures stem from the same root cause: the theoretical convergence proof requires the discriminator to be optimal at each step, but in practice we only take a few gradient steps on \(D\) before updating \(G\). This mismatch between theory and practice is the fundamental challenge of GAN training.
Let’s trace one training step with concrete numbers.
Setup: 1D data, real distribution centered at \(\mu = 3.0\). Generator and discriminator are single-layer networks.
Discriminator: \(D(x) = \sigma(w_d \cdot x + b_d)\) with \(w_d = 0.5\), \(b_d = -1.0\) Generator: \(G(z) = w_g \cdot z + b_g\) with \(w_g = 0.3\), \(b_g = 0.5\) (linear, no activation for simplicity)
Step 1: Sample data - Real sample: \(x = 3.2\) (drawn from \(\mathcal{N}(3.0, 0.5)\)) - Noise sample: \(z = 0.8\) (drawn from \(\mathcal{N}(0, 1)\)) - Generated fake: \(G(z) = 0.3 \times 0.8 + 0.5 = 0.74\)
Step 2: Compute discriminator outputs - \(D(x) = \sigma(0.5 \times 3.2 - 1.0) = \sigma(0.6) = 0.646\) - \(D(G(z)) = \sigma(0.5 \times 0.74 - 1.0) = \sigma(-0.63) = 0.347\)
The discriminator is somewhat correct: it gives the real sample 64.6% and the fake 34.7%.
Step 3: Compute discriminator loss gradient
The objective (to maximize) is \(\log D(x) + \log(1 - D(G(z)))\):
\[\log(0.646) + \log(1 - 0.347) = -0.437 + (-0.426) = -0.863\]
Gradient with respect to \(w_d\) (using chain rule through sigmoid): - From real term: \(\frac{\partial}{\partial w_d} \log D(x) = (1 - D(x)) \cdot x = (1 - 0.646) \times 3.2 = 1.133\) - From fake term: \(\frac{\partial}{\partial w_d} \log(1 - D(G(z))) = -D(G(z)) \cdot G(z) = -0.347 \times 0.74 = -0.257\) - Total: \(1.133 + (-0.257) = 0.876\)
With learning rate \(\alpha = 0.1\): \(w_d \leftarrow 0.5 + 0.1 \times 0.876 = 0.588\)
Step 4: Compute generator loss gradient
The generator wants to maximize \(\log D(G(z))\) (the practical trick): - \(\frac{\partial}{\partial w_g} \log D(G(z)) = \frac{1 - D(G(z))}{1} \cdot w_d \cdot z\)
Wait – let’s be precise. The chain rule gives: - \(\frac{\partial \log D(G(z))}{\partial w_g} = \frac{1}{D(G(z))} \cdot D(G(z))(1 - D(G(z))) \cdot w_d \cdot z\) - \(= (1 - D(G(z))) \cdot w_d \cdot z\) - \(= (1 - 0.347) \times 0.5 \times 0.8 = 0.261\)
With learning rate \(\alpha = 0.1\): \(w_g \leftarrow 0.3 + 0.1 \times 0.261 = 0.326\)
After this step, \(w_g\) increased, which means \(G(z)\) will produce larger values – shifting the generated distribution closer to the real data centered at 3.0. That’s exactly what we want.
Recall: In Algorithm 1, the discriminator is updated for \(k\) steps before the generator is updated once. Why not update both simultaneously? What goes wrong if \(k\) is too small?
Apply: Given \(D(x) = \sigma(2x - 3)\), compute \(D(x)\) for \(x = 2.0\) (real sample) and \(x = 0.5\) (fake sample). Then compute the discriminator’s objective \(\log D(2.0) + \log(1 - D(0.5))\).
Extend: The paper mentions that early in training, \(\log(1 - D(G(z)))\) saturates. Compute \(\log(1 - D(G(z)))\) and its derivative with respect to \(D(G(z))\) when \(D(G(z)) = 0.01\) (early training, discriminator easily rejects fakes). Then compute \(\log D(G(z))\) and its derivative at the same point. Which gradient is larger? By how much?
Why can’t a GAN compute the probability of a given input? Explain the difference between an implicit generative model (like a GAN) and an explicit one (like a Gaussian mixture model).
Why does the GAN objective use \(\log\) instead of, say, squared error? What would happen if the discriminator’s objective were \((D(x) - 1)^2 + (D(G(z)) - 0)^2\) instead of \(\log D(x) + \log(1 - D(G(z)))\)? (This is actually a valid variant – least-squares GAN.)
The convergence proof requires the discriminator to be optimal at every step. In practice, we only take \(k\) gradient steps on \(D\). Why doesn’t this break training entirely? What balancing act must \(k\) satisfy?
Explain mode collapse in your own words. Give a concrete example: if training on MNIST digits, what would mode collapse look like? Why doesn’t the minimax objective prevent it?
How does the GAN’s approach to generation differ from the VAE’s? What tradeoff does each approach make? (For context: the VAE, covered later in this collection, maximizes a variational lower bound on \(\log p(x)\).)
Implement a GAN from scratch using numpy to generate samples from a 1D Gaussian distribution.
import numpy as np
np.random.seed(42)
# ── Hyperparameters ──────────────────────────────────────────────
NOISE_DIM = 1
HIDDEN_DIM = 16
BATCH_SIZE = 128
LR_D = 0.01
LR_G = 0.01
TRAIN_STEPS = 5000
REAL_MEAN = 4.0
REAL_STD = 1.25
# ── Activation functions ─────────────────────────────────────────
def sigmoid(x):
"""Numerically stable sigmoid."""
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
def tanh(x):
return np.tanh(x)
def dtanh(x):
"""Derivative of tanh, given tanh output."""
return 1 - x ** 2
# ── Weight initialization ───────────────────────────────────────
def init_weights(fan_in, fan_out):
"""Xavier initialization."""
scale = np.sqrt(2.0 / (fan_in + fan_out))
return np.random.randn(fan_in, fan_out) * scale
# ── Discriminator ────────────────────────────────────────────────
# Architecture: input(1) -> hidden(HIDDEN_DIM) -> output(1)
d_W1 = init_weights(1, HIDDEN_DIM)
d_b1 = np.zeros((1, HIDDEN_DIM))
d_W2 = init_weights(HIDDEN_DIM, 1)
d_b2 = np.zeros((1, 1))
def discriminator_forward(x):
"""
Forward pass through discriminator.
x: shape (batch, 1)
Returns: (output, cache) where output is shape (batch, 1) in (0, 1)
"""
# TODO: Implement forward pass
# Layer 1: linear + tanh
# Layer 2: linear + sigmoid
# Return output and cache needed for backward pass
pass
def discriminator_backward(d_output, cache):
"""
Backward pass through discriminator.
d_output: gradient of loss w.r.t. discriminator output, shape (batch, 1)
cache: saved values from forward pass
Returns: gradients (d_W1, d_b1, d_W2, d_b2, d_input)
"""
# TODO: Implement backward pass using chain rule
pass
# ── Generator ────────────────────────────────────────────────────
# Architecture: input(NOISE_DIM) -> hidden(HIDDEN_DIM) -> output(1)
g_W1 = init_weights(NOISE_DIM, HIDDEN_DIM)
g_b1 = np.zeros((1, HIDDEN_DIM))
g_W2 = init_weights(HIDDEN_DIM, 1)
g_b2 = np.zeros((1, 1))
def generator_forward(z):
"""
Forward pass through generator.
z: shape (batch, NOISE_DIM)
Returns: (output, cache) where output is shape (batch, 1)
"""
# TODO: Implement forward pass
# Layer 1: linear + tanh
# Layer 2: linear (no activation -- output can be any real number)
pass
def generator_backward(d_output, cache):
"""
Backward pass through generator.
d_output: gradient of loss w.r.t. generator output, shape (batch, 1)
cache: saved values from forward pass
Returns: gradients (g_W1, g_b1, g_W2, g_b2)
"""
# TODO: Implement backward pass using chain rule
pass
# ── Training loop ────────────────────────────────────────────────
for step in range(TRAIN_STEPS):
# ── Sample real data and noise ───────────────────────────
real_data = np.random.normal(REAL_MEAN, REAL_STD, (BATCH_SIZE, 1))
noise = np.random.randn(BATCH_SIZE, NOISE_DIM)
# ── Train Discriminator ──────────────────────────────────
# 1. Forward pass: compute D(real) and D(G(z))
# 2. Compute gradients of: log D(real) + log(1 - D(G(z)))
# 3. Update discriminator weights (gradient ASCENT -- we maximize)
# TODO: Implement discriminator training step
# ── Train Generator ──────────────────────────────────────
# 1. Sample fresh noise
# 2. Forward pass: z -> G(z) -> D(G(z))
# 3. Compute gradients of: log D(G(z)) (the practical trick)
# 4. Update generator weights (gradient ASCENT -- we maximize log D(G(z)))
# TODO: Implement generator training step
# ── Logging ──────────────────────────────────────────────
if step % 500 == 0:
fake_samples = generator_forward(np.random.randn(1000, NOISE_DIM))[0]
fake_mean = fake_samples.mean()
fake_std = fake_samples.std()
print(f"Step {step:5d} | "
f"D(real)={0:.3f} D(fake)={0:.3f} | " # TODO: fill in actual values
f"G mean={fake_mean:.3f} std={fake_std:.3f} | "
f"Target mean={REAL_MEAN:.3f} std={REAL_STD:.3f}")
# ── Final evaluation ─────────────────────────────────────────────
final_noise = np.random.randn(10000, NOISE_DIM)
final_samples = generator_forward(final_noise)[0]
print(f"\nFinal generator distribution:")
print(f" Mean: {final_samples.mean():.3f} (target: {REAL_MEAN})")
print(f" Std: {final_samples.std():.3f} (target: {REAL_STD})")
print(f" Min: {final_samples.min():.3f}")
print(f" Max: {final_samples.max():.3f}")After 5000 training steps, the generator should approximately match the target distribution:
Step 0 | D(real)=0.500 D(fake)=0.500 | G mean=-0.012 std=0.347 | Target mean=4.000 std=1.250
Step 500 | D(real)=0.721 D(fake)=0.334 | G mean=1.832 std=0.654 | Target mean=4.000 std=1.250
Step 1000 | D(real)=0.643 D(fake)=0.398 | G mean=2.891 std=0.923 | Target mean=4.000 std=1.250
Step 1500 | D(real)=0.589 D(fake)=0.432 | G mean=3.412 std=1.087 | Target mean=4.000 std=1.250
Step 2000 | D(real)=0.554 D(fake)=0.461 | G mean=3.721 std=1.156 | Target mean=4.000 std=1.250
Step 2500 | D(real)=0.531 D(fake)=0.478 | G mean=3.876 std=1.201 | Target mean=4.000 std=1.250
Step 3000 | D(real)=0.519 D(fake)=0.488 | G mean=3.944 std=1.228 | Target mean=4.000 std=1.250
Step 3500 | D(real)=0.511 D(fake)=0.494 | G mean=3.971 std=1.239 | Target mean=4.000 std=1.250
Step 4000 | D(real)=0.506 D(fake)=0.498 | G mean=3.989 std=1.244 | Target mean=4.000 std=1.250
Step 4500 | D(real)=0.503 D(fake)=0.499 | G mean=3.996 std=1.248 | Target mean=4.000 std=1.250
Final generator distribution:
Mean: 3.998 (target: 4.0)
Std: 1.249 (target: 1.25)
Min: -0.312
Max: 8.274
Key things to verify: - \(D(\text{real})\) and \(D(\text{fake})\) both converge toward 0.5 (discriminator can’t tell the difference) - Generator mean converges toward 4.0 and std converges toward 1.25 - The convergence is gradual – the generator slowly shifts its output distribution
Note: exact numbers will vary due to random initialization, but the overall trend should match.