course

Before we can understand GANs, we need to understand what a discriminator does. A discriminator is just a binary classifier: given an input, it answers “real or fake?” with a number between 0 and 1.

Explanation

Think of a quality inspector at a factory. Products come down a conveyor belt – some are genuine, some are defective counterfeits that slipped in. The inspector examines each product and decides: “genuine” or “counterfeit.” A perfect inspector correctly identifies every item. A useless inspector just guesses randomly (50/50).

A discriminator \(D(x)\) is a function that takes a data sample \(x\) and outputs a probability: \(D(x) = 0.9\) means “I’m 90% sure this is real.” We want \(D(x)\) close to 1 for real data and close to 0 for fake data.

How do we train such a classifier? We use binary cross-entropy loss (also called log-loss). If we have a sample \(x\) with a true label \(y\) (1 for real, 0 for fake), the loss for that sample is:

The discriminator minimizes this loss over a batch of real and fake samples. Equivalently, it maximizes:

\[\frac{1}{m} \sum_{i=1}^{m} \left[\log D(x^{(i)}) + \log(1 - D(x_{\text{fake}}^{(i)}))\right]\]

where \(x^{(i)}\) are real samples and \(x_{\text{fake}}^{(i)}\) are fake samples.

The simplest discriminator is a single-layer neural network: a linear transformation followed by a sigmoid activation:

Worked Example

Let’s build a discriminator for 2D data. Suppose our real data comes from a cluster centered at \([2, 2]\) and our fake data comes from a cluster centered at \([0, 0]\).

\[w^T x + b = 0.5 \times 2.1 + 0.5 \times 1.9 + (-1.0) = 1.05 + 0.95 - 1.0 = 1.0\] \[D(x) = \sigma(1.0) = \frac{1}{1 + e^{-1.0}} = \frac{1}{1 + 0.368} = 0.731\]

The discriminator says this is 73.1% likely to be real. Not bad, but not confident yet.

\[w^T x_{\text{fake}} + b = 0.5 \times 0.3 + 0.5 \times (-0.2) + (-1.0) = 0.15 - 0.1 - 1.0 = -0.95\] \[D(x_{\text{fake}}) = \sigma(-0.95) = \frac{1}{1 + e^{0.95}} = \frac{1}{1 + 2.586} = 0.279\]

The discriminator says this is 27.9% likely to be real (72.1% likely to be fake). Reasonable.

Gradient update: The gradient of the loss with respect to \(w\) points in the direction that would reduce the loss. Training moves \(w\) to make \(D\) output higher values for real samples and lower values for fake samples. After many updates, the decision boundary shifts until the discriminator becomes more confident.

Exercises

Recall: What is the range of the sigmoid function \(\sigma(z)\)? What value does \(\sigma(0)\) return?

Apply: Given \(w = [1.0, -0.5]\), \(b = 0.0\), compute \(D(x)\) for \(x = [1.0, 2.0]\). Is the discriminator confident that this sample is real or fake?

Extend: If the real and fake data distributions overlap completely (identical distributions), what would the optimal discriminator output for every input? Why? (Hint: think about what a Bayesian classifier would do when \(P(\text{real}|x) = P(\text{fake}|x)\).)

Lesson 2: The Generator – Transforming Noise into Data

Now that we know what a discriminator does, we need something to generate the fake samples. The generator takes random noise and transforms it into something that looks like real data.

Explanation

Think of an artist who creates paintings. The artist doesn’t copy existing paintings – they start with a blank canvas (the random noise) and apply their skills (the neural network weights) to produce something new. A beginner artist produces scribbles that are obviously not real paintings. A master artist produces works that could fool an art critic.

A generator \(G(z)\) takes a noise vector \(z\) drawn from a simple distribution (like a standard Gaussian \(z \sim \mathcal{N}(0, I)\)) and transforms it into a data sample in the same space as the real data. If the real data is 2D points, \(G\) outputs 2D points. If the real data is 28x28 images, \(G\) outputs 28x28 images.

The simplest generator is a linear transformation followed by an activation function:

The key idea is that different noise vectors \(z\) produce different outputs \(G(z)\). As training progresses, the generator learns a mapping that transforms the simple noise distribution into the complex data distribution. Points that are close in \(z\)-space produce similar outputs, so the generator learns a smooth manifold of data-like samples.

Worked Example

Let’s build a generator that maps 2D noise to 2D data. Our target is data centered around \([2, 2]\).

\[W_g z + b_g = \begin{bmatrix} 0.5 & 0.3 \\ 0.2 & 0.7 \end{bmatrix} \begin{bmatrix} 0.5 \\ -0.3 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix}\]

\[= \begin{bmatrix} 0.5 \times 0.5 + 0.3 \times (-0.3) \\ 0.2 \times 0.5 + 0.7 \times (-0.3) \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} 0.25 - 0.09 \\ 0.10 - 0.21 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} 1.16 \\ 0.89 \end{bmatrix}\]

\[G(z) = \tanh\left(\begin{bmatrix} 1.16 \\ 0.89 \end{bmatrix}\right) = \begin{bmatrix} 0.821 \\ 0.712 \end{bmatrix}\]

The generator produced the point \([0.821, 0.712]\). The real data is centered at \([2, 2]\), so this fake is far off – a beginner counterfeiter. Through training, the weights \(W_g\) and \(b_g\) will shift so that the output moves toward \([2, 2]\).

\[W_g z + b_g = \begin{bmatrix} 0.5 \times (-0.8) + 0.3 \times 1.2 \\ 0.2 \times (-0.8) + 0.7 \times 1.2 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} -0.04 \\ 0.68 \end{bmatrix} + \begin{bmatrix} 1.0 \\ 1.0 \end{bmatrix} = \begin{bmatrix} 0.96 \\ 1.68 \end{bmatrix}\]

\[G(z) = \tanh\left(\begin{bmatrix} 0.96 \\ 1.68 \end{bmatrix}\right) = \begin{bmatrix} 0.744 \\ 0.932 \end{bmatrix}\]

Different noise \(\rightarrow\) different output. Both are still far from \([2, 2]\), but they are different from each other, showing that the generator is not collapsed to a single point.

Exercises

Recall: Why does the generator take random noise as input instead of, say, a fixed vector? What would happen if every call to \(G\) used the same input?

Apply: Given \(W_g = [[1.0, 0.0], [0.0, 1.0]]\) (identity matrix) and \(b_g = [2.0, 2.0]\), compute \(G(z)\) for \(z = [0.1, -0.1]\) using \(G(z) = \tanh(W_g z + b_g)\). How close is the output to \([2, 2]\)?

Extend: The \(\tanh\) function constrains outputs to \((-1, 1)\). What problem does this create if the real data has values outside this range (e.g., pixel values in \([0, 255]\))? How might you fix this?

Lesson 3: The Optimal Discriminator

Before we can analyze GAN training theoretically, we need to know: for a fixed generator \(G\), what is the best possible discriminator?

Explanation

Imagine you’re a statistician given two piles of coins: one pile is genuine currency, the other is counterfeits from a specific counterfeiter. You want to build the best possible classifier to distinguish them. With enough coins from each pile, you can estimate the probability of seeing any particular feature (weight, color, texture) in each pile. The optimal strategy is Bayes’ rule: for a coin with feature \(x\), classify it as genuine if \(P(\text{genuine} | x) > P(\text{counterfeit} | x)\).

For GANs, the same principle applies. The real data has distribution \(p_\text{data}(x)\) and the generator’s output has distribution \(p_g(x)\). The optimal discriminator applies Bayes’ rule:

This comes from maximizing the discriminator’s objective. The discriminator wants to maximize:

For each point \(x\), this is a function of the form \(a \log(y) + b \log(1 - y)\), where \(a = p_\text{data}(x)\) and \(b = p_g(x)\). Taking the derivative with respect to \(y\), setting it to zero, and solving gives \(y = a/(a+b)\).

The critical insight: when the generator perfectly matches the data (\(p_g = p_\text{data}\)), the optimal discriminator outputs \(D^*(x) = 1/2\) for every \(x\). It cannot tell real from fake because there is no difference. This is the equilibrium point.

Worked Example

Suppose both the real data and the generator produce 1D data. At a specific point \(x = 3.0\):

The discriminator is 71.4% confident this point is real. This makes sense – real data is 2.5x more likely to produce this value than the generator.

Now suppose the generator improves and matches the data at this point: \(p_g(3.0) = 0.25\).

The discriminator is now at 50% – maximum uncertainty. It genuinely cannot tell.

Let’s verify the optimality by checking nearby values. At \(D = 0.5\), the objective per-point is:

\[0.25 \times \log(0.5) + 0.25 \times \log(0.5) = 0.25 \times (-0.693) + 0.25 \times (-0.693) = -0.347\]

\[0.25 \times \log(0.6) + 0.25 \times \log(0.4) = 0.25 \times (-0.511) + 0.25 \times (-0.916) = -0.357\]

Indeed, \(D = 0.5\) gives a higher (less negative) objective than \(D = 0.6\), confirming optimality.

Exercises

Recall: When the generator perfectly matches the data distribution, what does the optimal discriminator output? What does this mean intuitively?

Apply: At point \(x = 5.0\), the real data density is \(p_\text{data}(5.0) = 0.05\) and the generator density is \(p_g(5.0) = 0.20\). Compute \(D^*(5.0)\). Is this point more likely to be real or fake?

Extend: Suppose the generator puts zero probability mass on some region (i.e., \(p_g(x) = 0\) for some \(x\)). What does \(D^*(x)\) equal in that region? What problem might this create for training? (Hint: think about what gradient signal \(G\) receives for samples it never produces.)

Lesson 4: The Minimax Game and Jensen-Shannon Divergence

We now have all the pieces: a discriminator, a generator, and the optimal discriminator formula. In this lesson, we put them together into the adversarial game and discover what it’s really optimizing.

Explanation

Think of a thermostat and a heater. The thermostat measures temperature (like the discriminator measuring “realness”) and the heater adjusts its output (like the generator adjusting its samples). When they are in equilibrium, the room temperature matches the target – neither the thermostat nor the heater needs to change. The minimax game in GANs works similarly: the two networks push each other until they reach an equilibrium where the generator’s distribution matches the data.

\[\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_\text{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log (1 - D(G(z)))]\]

Reading from right to left: for a fixed generator \(G\), the discriminator \(D\) maximizes \(V\) (it tries to correctly classify real vs. fake). Then the generator \(G\) minimizes the result (it tries to fool the best possible discriminator).

When we plug in the optimal discriminator \(D^*_G\), the generator’s objective simplifies to:

The Jensen-Shannon divergence (JSD) measures how different two probability distributions are. It is defined as:

\[JSD(p \| q) = \frac{1}{2} KL\left(p \left\| \frac{p + q}{2}\right.\right) + \frac{1}{2} KL\left(q \left\| \frac{p + q}{2}\right.\right)\]

where \(KL(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} dx\) is the Kullback-Leibler divergence.

Key properties of JSD: - Always \(\geq 0\) - Equals 0 if and only if \(p = q\) (the distributions are identical) - Symmetric: \(JSD(p \| q) = JSD(q \| p)\) (unlike KL divergence) - Bounded: \(0 \leq JSD(p \| q) \leq \log 2\)

Since \(C(G) = -\log 4 + 2 \cdot JSD\), and JSD is minimized (equals 0) only when \(p_g = p_\text{data}\), the global minimum is \(C(G) = -\log 4 \approx -1.386\), achieved when the generator perfectly matches the data.

This is the paper’s central theoretical result: the minimax game has a unique global optimum, and it corresponds to the generator learning the true data distribution.

Worked Example

Distribution P (real data): \(P = [0.7, 0.2, 0.1]\) over three outcomes Distribution Q (generator): \(Q = [0.3, 0.3, 0.4]\) over three outcomes

\[KL(P \| M) = 0.7 \log\frac{0.7}{0.5} + 0.2 \log\frac{0.2}{0.25} + 0.1 \log\frac{0.1}{0.25}\]

\[KL(Q \| M) = 0.3 \log\frac{0.3}{0.5} + 0.3 \log\frac{0.3}{0.25} + 0.4 \log\frac{0.4}{0.25}\]

This is higher than the optimum \(-1.386\), confirming the generator hasn’t matched the data yet. The difference \(0.188\) is the “price” the generator pays for not matching \(P\).

Exercises

Recall: What quantity does GAN training implicitly minimize? Is this quantity symmetric (does \(JSD(p \| q) = JSD(q \| p)\))?

Apply: Compute \(JSD(P \| Q)\) for \(P = [0.5, 0.5]\) and \(Q = [0.5, 0.5]\) (identical distributions). Then compute it for \(P = [1.0, 0.0]\) and \(Q = [0.0, 1.0]\) (maximally different). Use natural log.

Extend: The KL divergence \(KL(p \| q)\) is undefined when \(q(x) = 0\) but \(p(x) > 0\) (because \(\log(p/0) = \infty\)). Explain why the JSD avoids this problem. Why does this make JSD a better choice for GAN training than KL divergence?

Lesson 5: GAN Training – Putting It All Together

This is the payoff lesson. We combine the discriminator, generator, optimal discriminator theory, and minimax game into the actual training algorithm. We also confront the practical challenges that make GAN training notoriously difficult.

Explanation

Recall the counterfeiter and detective analogy. Training a GAN is like running this competition in rounds:

Round 1 (Update D): 1. Sample \(m\) real data points from the training set 2. Sample \(m\) noise vectors and pass them through \(G\) to get \(m\) fake data points 3. Update \(D\)’s weights to better distinguish real from fake (gradient ascent on \(D\)’s objective) 4. Repeat for \(k\) steps (the paper uses \(k = 1\))

Round 2 (Update G): 1. Sample \(m\) noise vectors and pass them through \(G\) 2. Pass the fakes through \(D\) to get scores 3. Update \(G\)’s weights to make \(D\) output higher scores for the fakes (gradient descent on \(G\)’s objective)

\[\nabla_{\theta_d} \frac{1}{m} \sum_{i=1}^m \left[ \log D(x^{(i)}) + \log(1 - D(G(z^{(i)}))) \right]\]

The practical trick: Early in training, \(D\) easily distinguishes the terrible fakes, so \(D(G(z)) \approx 0\) and \(\log(1 - D(G(z))) \approx \log(1) = 0\). The gradient is nearly zero – the generator receives almost no learning signal. The fix: instead of minimizing \(\log(1 - D(G(z)))\), maximize \(\log D(G(z))\). When \(D(G(z)) \approx 0\), the gradient of \(\log D(G(z))\) is very large, providing a strong push for \(G\) to improve.

Training Failure Modes

Mode collapse (the “Helvetica scenario”): The generator finds one output that fools the discriminator and maps every noise vector \(z\) to that same output. For image generation, this means producing the same image regardless of input noise. The generator has “collapsed” to a single mode of the data distribution.

Oscillation: Instead of converging, \(G\) and \(D\) chase each other in circles. \(G\) learns to fool \(D\) in one way, \(D\) adapts, \(G\) switches to a different strategy, and neither settles down. The loss oscillates without converging.

Both failures stem from the same root cause: the theoretical convergence proof requires the discriminator to be optimal at each step, but in practice we only take a few gradient steps on \(D\) before updating \(G\). This mismatch between theory and practice is the fundamental challenge of GAN training.

Worked Example

Setup: 1D data, real distribution centered at \(\mu = 3.0\). Generator and discriminator are single-layer networks.

Discriminator: \(D(x) = \sigma(w_d \cdot x + b_d)\) with \(w_d = 0.5\), \(b_d = -1.0\) Generator: \(G(z) = w_g \cdot z + b_g\) with \(w_g = 0.3\), \(b_g = 0.5\) (linear, no activation for simplicity)

Step 1: Sample data - Real sample: \(x = 3.2\) (drawn from \(\mathcal{N}(3.0, 0.5)\)) - Noise sample: \(z = 0.8\) (drawn from \(\mathcal{N}(0, 1)\)) - Generated fake: \(G(z) = 0.3 \times 0.8 + 0.5 = 0.74\)

Step 2: Compute discriminator outputs - \(D(x) = \sigma(0.5 \times 3.2 - 1.0) = \sigma(0.6) = 0.646\) - \(D(G(z)) = \sigma(0.5 \times 0.74 - 1.0) = \sigma(-0.63) = 0.347\)

The discriminator is somewhat correct: it gives the real sample 64.6% and the fake 34.7%.

Gradient with respect to \(w_d\) (using chain rule through sigmoid): - From real term: \(\frac{\partial}{\partial w_d} \log D(x) = (1 - D(x)) \cdot x = (1 - 0.646) \times 3.2 = 1.133\) - From fake term: \(\frac{\partial}{\partial w_d} \log(1 - D(G(z))) = -D(G(z)) \cdot G(z) = -0.347 \times 0.74 = -0.257\) - Total: \(1.133 + (-0.257) = 0.876\)

With learning rate \(\alpha = 0.1\): \(w_d \leftarrow 0.5 + 0.1 \times 0.876 = 0.588\)

The generator wants to maximize \(\log D(G(z))\) (the practical trick): - \(\frac{\partial}{\partial w_g} \log D(G(z)) = \frac{1 - D(G(z))}{1} \cdot w_d \cdot z\)

Wait – let’s be precise. The chain rule gives: - \(\frac{\partial \log D(G(z))}{\partial w_g} = \frac{1}{D(G(z))} \cdot D(G(z))(1 - D(G(z))) \cdot w_d \cdot z\) - \(= (1 - D(G(z))) \cdot w_d \cdot z\) - \(= (1 - 0.347) \times 0.5 \times 0.8 = 0.261\)

With learning rate \(\alpha = 0.1\): \(w_g \leftarrow 0.3 + 0.1 \times 0.261 = 0.326\)

After this step, \(w_g\) increased, which means \(G(z)\) will produce larger values – shifting the generated distribution closer to the real data centered at 3.0. That’s exactly what we want.

Exercises

Recall: In Algorithm 1, the discriminator is updated for \(k\) steps before the generator is updated once. Why not update both simultaneously? What goes wrong if \(k\) is too small?

Apply: Given \(D(x) = \sigma(2x - 3)\), compute \(D(x)\) for \(x = 2.0\) (real sample) and \(x = 0.5\) (fake sample). Then compute the discriminator’s objective \(\log D(2.0) + \log(1 - D(0.5))\).

Extend: The paper mentions that early in training, \(\log(1 - D(G(z)))\) saturates. Compute \(\log(1 - D(G(z)))\) and its derivative with respect to \(D(G(z))\) when \(D(G(z)) = 0.01\) (early training, discriminator easily rejects fakes). Then compute \(\log D(G(z))\) and its derivative at the same point. Which gradient is larger? By how much?

Comprehension Questions

Hands-On Project

Goal

Implement a GAN from scratch using numpy to generate samples from a 1D Gaussian distribution.

Specification

Starter Code

import numpy as np

np.random.seed(42)

# ── Hyperparameters ──────────────────────────────────────────────
NOISE_DIM = 1
HIDDEN_DIM = 16
BATCH_SIZE = 128
LR_D = 0.01
LR_G = 0.01
TRAIN_STEPS = 5000
REAL_MEAN = 4.0
REAL_STD = 1.25


# ── Activation functions ─────────────────────────────────────────
def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))


def tanh(x):
    return np.tanh(x)


def dtanh(x):
    """Derivative of tanh, given tanh output."""
    return 1 - x ** 2


# ── Weight initialization ───────────────────────────────────────
def init_weights(fan_in, fan_out):
    """Xavier initialization."""
    scale = np.sqrt(2.0 / (fan_in + fan_out))
    return np.random.randn(fan_in, fan_out) * scale


# ── Discriminator ────────────────────────────────────────────────
# Architecture: input(1) -> hidden(HIDDEN_DIM) -> output(1)
d_W1 = init_weights(1, HIDDEN_DIM)
d_b1 = np.zeros((1, HIDDEN_DIM))
d_W2 = init_weights(HIDDEN_DIM, 1)
d_b2 = np.zeros((1, 1))


def discriminator_forward(x):
    """
    Forward pass through discriminator.
    x: shape (batch, 1)
    Returns: (output, cache) where output is shape (batch, 1) in (0, 1)
    """
    # TODO: Implement forward pass
    # Layer 1: linear + tanh
    # Layer 2: linear + sigmoid
    # Return output and cache needed for backward pass
    pass


def discriminator_backward(d_output, cache):
    """
    Backward pass through discriminator.
    d_output: gradient of loss w.r.t. discriminator output, shape (batch, 1)
    cache: saved values from forward pass
    Returns: gradients (d_W1, d_b1, d_W2, d_b2, d_input)
    """
    # TODO: Implement backward pass using chain rule
    pass


# ── Generator ────────────────────────────────────────────────────
# Architecture: input(NOISE_DIM) -> hidden(HIDDEN_DIM) -> output(1)
g_W1 = init_weights(NOISE_DIM, HIDDEN_DIM)
g_b1 = np.zeros((1, HIDDEN_DIM))
g_W2 = init_weights(HIDDEN_DIM, 1)
g_b2 = np.zeros((1, 1))


def generator_forward(z):
    """
    Forward pass through generator.
    z: shape (batch, NOISE_DIM)
    Returns: (output, cache) where output is shape (batch, 1)
    """
    # TODO: Implement forward pass
    # Layer 1: linear + tanh
    # Layer 2: linear (no activation -- output can be any real number)
    pass


def generator_backward(d_output, cache):
    """
    Backward pass through generator.
    d_output: gradient of loss w.r.t. generator output, shape (batch, 1)
    cache: saved values from forward pass
    Returns: gradients (g_W1, g_b1, g_W2, g_b2)
    """
    # TODO: Implement backward pass using chain rule
    pass


# ── Training loop ────────────────────────────────────────────────
for step in range(TRAIN_STEPS):
    # ── Sample real data and noise ───────────────────────────
    real_data = np.random.normal(REAL_MEAN, REAL_STD, (BATCH_SIZE, 1))
    noise = np.random.randn(BATCH_SIZE, NOISE_DIM)

    # ── Train Discriminator ──────────────────────────────────
    # 1. Forward pass: compute D(real) and D(G(z))
    # 2. Compute gradients of: log D(real) + log(1 - D(G(z)))
    # 3. Update discriminator weights (gradient ASCENT -- we maximize)

    # TODO: Implement discriminator training step

    # ── Train Generator ──────────────────────────────────────
    # 1. Sample fresh noise
    # 2. Forward pass: z -> G(z) -> D(G(z))
    # 3. Compute gradients of: log D(G(z))  (the practical trick)
    # 4. Update generator weights (gradient ASCENT -- we maximize log D(G(z)))

    # TODO: Implement generator training step

    # ── Logging ──────────────────────────────────────────────
    if step % 500 == 0:
        fake_samples = generator_forward(np.random.randn(1000, NOISE_DIM))[0]
        fake_mean = fake_samples.mean()
        fake_std = fake_samples.std()
        print(f"Step {step:5d} | "
              f"D(real)={0:.3f} D(fake)={0:.3f} | "  # TODO: fill in actual values
              f"G mean={fake_mean:.3f} std={fake_std:.3f} | "
              f"Target mean={REAL_MEAN:.3f} std={REAL_STD:.3f}")

# ── Final evaluation ─────────────────────────────────────────────
final_noise = np.random.randn(10000, NOISE_DIM)
final_samples = generator_forward(final_noise)[0]
print(f"\nFinal generator distribution:")
print(f"  Mean: {final_samples.mean():.3f} (target: {REAL_MEAN})")
print(f"  Std:  {final_samples.std():.3f} (target: {REAL_STD})")
print(f"  Min:  {final_samples.min():.3f}")
print(f"  Max:  {final_samples.max():.3f}")

Expected Output

After 5000 training steps, the generator should approximately match the target distribution:

Key things to verify: - \(D(\text{real})\) and \(D(\text{fake})\) both converge toward 0.5 (discriminator can’t tell the difference) - Generator mean converges toward 4.0 and std converges toward 1.25 - The convergence is gradual – the generator slowly shifts its output distribution

Note: exact numbers will vary due to random initialization, but the overall trend should match.

Generative Adversarial Nets: Course

Learning Objectives

Prerequisites

Lesson 1: Binary Classification and the Discriminator

Explanation

Worked Example

Exercises

Lesson 2: The Generator – Transforming Noise into Data

Explanation

Worked Example

Exercises

Lesson 3: The Optimal Discriminator

Explanation

Worked Example

Exercises

Lesson 4: The Minimax Game and Jensen-Shannon Divergence

Explanation

Worked Example

Exercises

Lesson 5: GAN Training – Putting It All Together

Explanation

Training Failure Modes

Worked Example

Exercises

Comprehension Questions

Hands-On Project

Goal

Specification

Starter Code

Expected Output

Further Reading