course

Imagine a restaurant kitchen with one master chef who knows every recipe. Now imagine that for each new customer who orders a different dish, you must clone the entire chef – their years of training, all their knowledge – just to change how they plate the food. That is what full fine-tuning does: it copies every single parameter of a large model to change the behavior on each new task.

Explanation

In transfer learning, you take a model pre-trained on a large dataset and adapt it to a new, usually smaller, task. By 2019, the standard approach was full fine-tuning: copy all the weights of a model like BERT, then update every weight on the new task’s training data. BERT-Large has 330 million parameters. If you fine-tune it for 9 different tasks (the GLUE benchmark – a collection of nine natural language understanding tasks used to evaluate general-purpose language representations), you need 9 separate copies: \(9 \times 330\text{M} = 2.97\) billion parameters in total.

Formally, consider a pre-trained network \(\phi_w(x)\) with parameters \(w\). Fine-tuning updates \(w\) for each task, producing task-specific parameter sets \(w_1, w_2, \ldots, w_N\). The total storage is:

where \(N\) is the number of tasks and \(|w|\) is the number of parameters in the pre-trained model.

There is an alternative: feature-based transfer. Instead of changing the pre-trained weights, you freeze them and train a new, small function \(\chi_v\) on top:

where only \(v\) is trained and \(w\) stays frozen. This is parameter efficient – you only store \(|w| + N \times |v|\) total parameters – but it performs worse because the pre-trained features cannot adapt to the task at all.

The paper frames the problem crisply: feature-based transfer is cheap but weak, fine-tuning is strong but expensive. The field needs a method that gets fine-tuning-level performance at feature-extraction-level cost.

There is a third practical problem beyond storage: catastrophic forgetting. When you fine-tune a model on Task A and then fine-tune the same copy on Task B, the model loses its ability to do Task A. Each task’s fine-tuning destructively overwrites the shared weights. You cannot incrementally add new tasks – each one starts from scratch.

Worked Example

Suppose you run a cloud service with BERT-Base (110M parameters) and need to serve 17 text classification tasks. Calculate the total parameter budget under each strategy:

Feature-based transfer: freeze BERT-Base and train a linear classifier per task. A classifier for a task with \(C = 4\) classes on BERT-Base’s 768-dimensional output has \(768 \times 4 + 4 = 3{,}076\) parameters.

Adapter tuning (what this paper proposes): freeze BERT-Base, add small adapter modules per task. With adapters using 1.14% of parameters per task:

The jump from \(17\times\) (fine-tuning) to \(1.19\times\) (adapters) while losing less than 0.5% accuracy is the paper’s central result.

Exercises

Recall: What are the two practical problems with full fine-tuning when serving many tasks? Name both.

Apply: You have a BERT-Large model (330M parameters) and need to serve 50 tasks. How many total parameters does full fine-tuning require? How about adapter tuning at 3.6% parameters per task? Express both as multiples of the base model size.

Extend: Suppose a company serves 1,000 customer-specific models. At what point does the difference between fine-tuning and adapter tuning start to matter for a budget – when models are 100M parameters? 1B? 100B? What factors beyond parameter count matter (think about GPU memory, loading time, and serving infrastructure)?

Lesson 2: The Bottleneck – Compress, Transform, Expand

A bottleneck in engineering is a narrow passage that restricts flow. In a factory, you might funnel thousands of parts through a single inspection point, then fan them back out to their respective assembly lines. The bottleneck forces everything through a compact representation. In neural networks, a bottleneck layer compresses a high-dimensional representation into a much smaller one, applies a transformation, then expands it back. This compression forces the network to learn a compact encoding of what matters.

Explanation

The adapter module is a two-layer neural network with a bottleneck shape. It takes an input vector \(h\) of dimension \(d\) (the hidden size of the Transformer – 768 for BERT-Base), squeezes it down to a much smaller dimension \(m\), and then expands it back to \(d\).

where \(h \in \mathbb{R}^d\) is the input, \(W_{\text{down}} \in \mathbb{R}^{d \times m}\) is the down-projection weight matrix, and \(z \in \mathbb{R}^m\) is the compressed representation.

where \(f\) is a nonlinearity. The paper uses ReLU (Rectified Linear Unit), which simply sets all negative values to zero: \(f(x) = \max(0, x)\). Without this nonlinearity, the two projections would collapse into a single linear transformation and the adapter could not learn nonlinear task-specific features.

where \(W_{\text{up}} \in \mathbb{R}^{m \times d}\) is the up-projection weight matrix and \(\hat{h} \in \mathbb{R}^d\) is the output.

where \(2md\) accounts for the two weight matrices (\(d \times m\) down-projection and \(m \times d\) up-projection), \(m\) is the bias for the down-projection, and \(d\) is the bias for the up-projection.

The bottleneck dimension \(m\) controls the trade-off between capacity and efficiency. The paper tests values from 2 to 256. Setting \(m \ll d\) ensures the adapter is small relative to the main model.

Worked Example

Let \(d = 6\) (hidden dimension) and \(m = 2\) (bottleneck dimension). Our input vector is:

Step 1: Down-projection. Multiply by \(W_{\text{down}}\) (a \(6 \times 2\) matrix) and add bias \(b_{\text{down}}\):

\[W_{\text{down}} = \begin{bmatrix} 0.2 & -0.1 \\ 0.3 & 0.4 \\ -0.1 & 0.2 \\ 0.5 & -0.3 \\ 0.1 & 0.6 \\ -0.2 & 0.1 \end{bmatrix}, \quad b_{\text{down}} = [0.0, 0.0]\]

Step 3: Up-projection. Multiply by \(W_{\text{up}}\) (a \(2 \times 6\) matrix) and add bias \(b_{\text{up}}\):

Since \(a = [0.0, 0.0]\), the result is simply the bias: \(\hat{h} = b_{\text{up}}\).

This illustrates an important point: when the adapter weights are small (or produce negative activations), the output of the adapter is near zero. This is exactly what happens at initialization.

Parameter count: \(2 \times 6 \times 2 + 6 + 2 = 24 + 8 = 32\) parameters for this toy adapter. For BERT-Base with \(d = 768\) and \(m = 64\): \(2 \times 768 \times 64 + 768 + 64 = 98{,}304 + 832 = 99{,}136\) parameters.

Exercises

Recall: Why is a nonlinear activation function (like ReLU) necessary between the down-projection and up-projection? What would happen without it?

Apply: Calculate the parameter count for a single adapter module with \(d = 1024\) (BERT-Large) and \(m = 8\). Then calculate it for \(m = 256\). What is the ratio between them?

Extend: The bottleneck forces the adapter to learn a compressed representation in \(m\) dimensions. If your task is very simple (binary sentiment classification), would you expect a small or large \(m\) to suffice? What about a complex task with 157 classes? The paper’s results show adapter sizes of 8 to 256 across tasks – what does this range tell you about task complexity?

Lesson 3: Skip Connections and Near-Identity Initialization

Think of a highway bypass. When a new stretch of road is under construction, traffic can still flow on the original highway. The new section carries no traffic initially – it starts “closed.” As construction finishes, traffic gradually shifts to use the new route. This is exactly what a skip connection does for an adapter: it lets the original signal pass through unmodified while the new adapter parameters start from near-zero, initially contributing nothing and gradually learning task-specific adjustments.

Explanation

where \(h \in \mathbb{R}^d\) is the input hidden representation, \(W_{\text{down}} \in \mathbb{R}^{d \times m}\) is the down-projection, \(W_{\text{up}} \in \mathbb{R}^{m \times d}\) is the up-projection, and \(f\) is the ReLU activation. The critical piece is the leading “\(h +\)” – the skip connection (also called a residual connection). The adapter’s output is added to its input, rather than replacing it.

This skip connection enables near-identity initialization. If the adapter weights \(W_{\text{down}}\) and \(W_{\text{up}}\) are initialized near zero, then \(f(h W_{\text{down}}) W_{\text{up}} \approx 0\), and the output is approximately:

The adapter acts as the identity function – it passes the input through unchanged. This means that when training begins, the modified network \(\psi_{w,v}(x)\) behaves almost identically to the original pre-trained network \(\phi_w(x)\):

The paper initializes adapter weights by drawing from a zero-mean Gaussian with standard deviation \(10^{-2}\), truncated to two standard deviations. Their analysis shows that performance is stable for initialization scales below \(10^{-2}\), but degrades when the scale is too large – confirming that near-identity initialization matters.

Worked Example

\[W_{\text{down}} = \begin{bmatrix} 0.005 & -0.008 \\ 0.012 & 0.003 \\ -0.007 & 0.011 \\ 0.009 & -0.006 \end{bmatrix}\]

\[W_{\text{up}} = \begin{bmatrix} 0.003 & -0.01 & 0.007 & 0.002 \\ -0.006 & 0.004 & 0.009 & -0.005 \end{bmatrix}\]

Up-project: \(\hat{h} = [0.0193 \times 0.003, \; 0.0193 \times (-0.01), \; 0.0193 \times 0.007, \; 0.0193 \times 0.002]\) \(= [0.0001, -0.0002, 0.0001, 0.00004]\)

After skip connection: \(h_{\text{out}} = h + \hat{h} = [1.0001, 0.4998, -0.7999, 0.30004]\)

The output is almost identical to the input. The adapter changes the representation by less than 0.03%.

If the weights were 100 times larger, the adapter output \(\hat{h}\) would be roughly 100 times larger too. Instead of a 0.03% perturbation, you would get a 3% perturbation – enough to destabilize a pre-trained model that expects its internal representations to stay within learned ranges.

Exercises

Recall: What two properties of the adapter design enable the near-identity initialization? Name both.

Apply: An adapter with \(d = 768\) and \(m = 64\) has weights initialized from \(\mathcal{N}(0, 0.01^2)\). Estimate the magnitude of a typical adapter output element before the skip connection. (Hint: each output element is the dot product of a 64-dimensional vector of ReLU-activated values with a 64-dimensional column of \(W_{\text{up}}\). Both have entries on the order of \(0.01\).)

Extend: The paper found that initialization scales above \(10^{-2}\) hurt performance, especially on smaller datasets like CoLA. Why might smaller datasets be more sensitive to initialization scale? Think about the relationship between the amount of training data and the model’s ability to recover from a bad starting point.

Lesson 4: Placing Adapters Inside the Transformer

You know how adapters work in isolation. Now the question is: where do you insert them in the Transformer? A Transformer layer has two main sub-layers (attention and feedforward), each followed by a residual connection and layer normalization. The placement of adapters relative to these components determines what representations the adapters can modify and how training gradients flow.

Explanation

Recall the standard Transformer layer structure (see Attention Is All You Need). Each layer has two sub-layers:

The adapter sits between the sub-layer’s output projection and the Transformer’s own residual connection. This placement means the adapter can modify what gets added to the residual stream without interfering with the residual connection itself. The adapter has its own internal skip connection (from Lesson 3), and the Transformer has its external skip connection. These are separate mechanisms – the adapter’s skip connection enables near-identity initialization, while the Transformer’s skip connection preserves information flow across layers.

During training, the adapter parameters and layer normalization parameters are updated. Everything else – all attention weights, feedforward weights, and embedding weights – stays frozen. The layer normalization parameters are re-trained because they are small (\(2d\) parameters per layer, compared to millions in the attention and feedforward weights) and adapting them helps the model adjust its internal statistics.

The paper’s ablation study (systematically removing or varying one component at a time to measure its contribution) reveals an asymmetry: adapters in the lower layers (layers 0-4 in BERT-Base’s 12 layers) contribute much less than adapters in the higher layers. Removing adapters from layers 0-4 barely affects accuracy on MNLI, while removing all adapters drops performance to the majority-class baseline. This aligns with a general finding in deep networks: lower layers learn generic features (syntax, word relationships) shared across tasks, while higher layers learn task-specific features (sentiment polarity, entailment relationships).

Worked Example

Consider a simplified 2-layer Transformer with \(d = 4\) and adapter bottleneck \(m = 2\). Count the trainable parameters for a single task.

Whole model (2 layers): \(2 \times 60 = 120\) adapter and layer norm parameters

Frozen for 2 layers: \({\approx}384\) parameters. Trainable: 120 parameters. The ratio is \(120 / (384 + 120) \approx 24\%\). In our tiny toy example this seems high, but the ratio shrinks dramatically at real scale because the frozen attention and feedforward weights grow as \(O(d^2)\) while the adapter parameters grow as \(O(md)\) with \(m \ll d\).

Exercises

Recall: At what two points within each Transformer layer does the paper insert adapter modules? What comes after the adapter in the processing pipeline?

Apply: BERT-Large has 24 layers and \(d = 1024\). Calculate the total number of adapter parameters when using bottleneck size \(m = 256\). Express as a percentage of BERT-Large’s 330M parameters.

Extend: The ablation study shows lower-layer adapters contribute less. The paper inserts adapters at every layer anyway. Can you think of a modified design that exploits this finding? What are the trade-offs of skipping adapters in lower layers versus keeping them?

Lesson 5: Adapter Tuning – The Full System

We have all the components: the bottleneck architecture (Lesson 2), the near-identity initialization with skip connections (Lesson 3), and the placement strategy within the Transformer (Lesson 4). Now we assemble the full adapter tuning system and examine how it performs against the alternatives.

Explanation

The adapter-based transfer approach can be summarized by a single equation. Given a pre-trained network \(\phi_w\) with parameters \(w\), we define a new network:

where \(v\) represents all the adapter parameters and layer normalization parameters. During training, \(w\) is frozen and only \(v\) is updated using standard supervised learning on the downstream task.

\[|v| \ll |w| \implies \text{total for } N \text{ tasks} \approx |w| + N \cdot |v| \approx |w|\]

When \(|v|\) is small relative to \(|w|\), the total model size for \(N\) tasks stays close to \(|w|\) instead of growing as \(N \times |w|\).

Training procedure. The paper follows the standard BERT fine-tuning recipe: optimize with Adam (an adaptive learning rate optimizer that maintains per-parameter momentum and scaling), linearly warm up the learning rate for the first 10% of steps, then linearly decay to zero. The only adapter-specific hyperparameter is the bottleneck dimension \(m\), chosen from \(\{8, 64, 256\}\) per task. Due to training instability, the paper runs each configuration with 5 random seeds and selects the best on the validation set.

Method	Trained params/task	Total params (9 tasks)	Average Score
Full fine-tuning	100%	9.0x	80.4
Adapters (best \(m\) per task)	3.6%	1.3x	80.0
Adapters (fixed \(m = 64\))	2.1%	1.2x	79.6

Adapters reach within 0.4 points of full fine-tuning while requiring 7 times fewer total parameters. The optimal adapter size varies by task: MNLI (a large dataset with 393K examples) uses \(m = 256\), while RTE (a small dataset with 2.5K examples) uses \(m = 8\). Larger tasks benefit from more adapter capacity.

On 17 additional classification tasks using BERT-Base, adapters at 1.14% parameters per task achieve 73.3% average accuracy versus 73.7% for full fine-tuning – a gap of just 0.4% while reducing total model size from \(17\times\) to \(1.19\times\).

On SQuAD v1.1 (extractive question answering), adapters of size 64 (2% of parameters) score 90.4% F1 (the harmonic mean of precision and recall, balancing how many correct answers the model finds with how many of its answers are correct), compared to 90.7% for full fine-tuning. Even adapters of size 2 (0.1% of parameters) score 89.9% F1.

What the adapters learn. The ablation study removes adapters from continuous spans of layers and measures accuracy. Removing adapters from layers 0-4 barely affects MNLI accuracy. Removing all adapters drops performance to the majority-class baseline (37% on MNLI). No single adapter is critical – removing any one layer’s adapters drops accuracy by at most 2% – but their collective effect is large. The adapters have learned to distribute task-specific knowledge across many layers, with higher layers carrying more of the load.

Worked Example

Walk through the full adapter tuning pipeline for a single task: binary sentiment classification on a tiny dataset.

Step 1: Initialize. Insert 4 adapter modules (2 per layer). Each adapter has \(2(2)(4) + 4 + 2 = 22\) parameters. Total adapter parameters: \(4 \times 22 = 88\). Initialize all adapter weights from \(\mathcal{N}(0, 0.01^2)\). The pre-trained Transformer weights are frozen.

Step 2: Forward pass (before training). The input flows through the Transformer. At each adapter, the near-zero weights produce a near-zero output. The skip connection passes the input through unchanged. The model produces the same output as the original pre-trained model – no task-specific behavior yet.

Step 3: Compute loss. The classification head (a linear layer mapping from \(d = 4\) to 2 classes) produces logits (raw unnormalized scores before conversion to probabilities). Suppose the model outputs \([0.2, -0.1]\) for classes [positive, negative]. The true label is “positive” (class 0). The cross-entropy loss (a loss function that measures how far the model’s predicted probability distribution is from the true label, reaching zero when the model assigns probability 1.0 to the correct class) is:

\[L = -\log\left(\frac{e^{0.2}}{e^{0.2} + e^{-0.1}}\right) = -\log\left(\frac{1.221}{1.221 + 0.905}\right) = -\log(0.574) = 0.555\]

Step 4: Backward pass. Gradients flow through the classification head, through the adapters, and through the Transformer – but only the adapter weights, layer normalization parameters, and classification head weights are updated. The pre-trained Transformer weights receive gradients but those gradients are discarded (weights frozen).

Step 5: After training. After many steps of gradient updates, the adapter weights have grown from near-zero to task-specific values. The adapters in higher layers have learned to emphasize features relevant to sentiment. The model now outputs high confidence for the correct sentiment class. The pre-trained weights remain exactly as they were.

Step 6: Add another task. To serve a second task (topic classification), create a new set of 88 adapter parameters. The 2-layer Transformer is shared. Total parameter overhead for the second task: 88 parameters (not a full copy of the Transformer).

Exercises

Recall: What three categories of parameters are updated during adapter tuning? What stays frozen?

Apply: You deploy BERT-Base (110M parameters) with adapters (\(m = 64\), roughly 2.4M adapter parameters per task) to serve 100 tasks. Calculate the total parameters for: (a) full fine-tuning, (b) adapter tuning. What is the storage savings in gigabytes, assuming 4 bytes per parameter (float32)?

Extend: The paper reports that on SQuAD, adapters of size 2 (0.1% of parameters) still achieve 89.9% F1 versus 90.7% for full fine-tuning. Why might extractive question answering need so few task-specific parameters compared to tasks like CoLA (grammatical acceptability)? Think about what the pre-trained model already knows versus what each task demands.

Comprehension Questions

Hands-On Project

Goal

Build a minimal adapter module and demonstrate that it starts as a near-identity function, can be trained to modify a frozen network’s behavior, and scales efficiently across multiple tasks.

Specification

Implement a bottleneck adapter layer in numpy. Insert it into a frozen two-layer feedforward network (standing in for a Transformer). Train adapter parameters on two separate toy classification tasks, showing that:

Starter Code

import numpy as np

np.random.seed(42)

# --- Frozen pre-trained network (simulates a 2-layer Transformer) ---
# Two feedforward layers with ReLU, pretrained weights are FIXED
d = 8        # hidden dimension
d_ff = 16    # feedforward intermediate dimension

# "Pre-trained" weights (frozen, do not modify these during training)
W1 = np.random.randn(d, d_ff) * 0.3
b1 = np.zeros(d_ff)
W2 = np.random.randn(d_ff, d) * 0.3
b2 = np.zeros(d)

def frozen_layer(x, W, b):
    """One frozen feedforward layer: linear + ReLU."""
    return np.maximum(0, x @ W + b)

def frozen_network(x):
    """Two-layer frozen network."""
    h = frozen_layer(x, W1, b1)
    h = h @ W2 + b2
    return h

# --- TODO: Implement adapter module ---
class Adapter:
    def __init__(self, d, m, init_scale=1e-2):
        """
        Bottleneck adapter with skip connection.
        Args:
            d: input/output dimension
            m: bottleneck dimension
            init_scale: standard deviation for weight initialization
        """
        # TODO: Initialize W_down (d x m), b_down (m,),
        #       W_up (m x d), b_up (d,)
        #       Use np.random.randn(...) * init_scale for weights
        #       Use np.zeros(...) for biases
        pass

    def forward(self, h):
        """
        Forward pass: h + ReLU(h @ W_down + b_down) @ W_up + b_up
        Returns:
            output: adapter output (same shape as h)
            cache: tuple of intermediate values needed for backward pass
        """
        # TODO: Implement the bottleneck with skip connection
        # 1. Down-project: z = h @ W_down + b_down
        # 2. ReLU: a = max(0, z)
        # 3. Up-project: delta = a @ W_up + b_up
        # 4. Skip connection: output = h + delta
        # 5. Return output and cache = (h, z, a) for backprop
        pass

    def backward(self, grad_output, cache):
        """
        Backward pass: compute gradients for W_down, b_down, W_up, b_up.
        Args:
            grad_output: gradient of loss w.r.t. adapter output, shape (batch, d)
            cache: (h, z, a) from forward pass
        Returns:
            None (stores gradients as self.dW_down, self.db_down, etc.)
        """
        h, z, a = cache
        batch_size = h.shape[0]

        # TODO: Implement backpropagation through the adapter
        # grad_output flows through both the skip connection and the bottleneck
        #
        # Hints:
        # - The skip connection means grad_output passes through directly to the
        #   input, AND also flows through the bottleneck path
        # - For the bottleneck path:
        #   1. Gradient through W_up: self.dW_up = a.T @ grad_output / batch_size
        #   2. self.db_up = grad_output.mean(axis=0)
        #   3. Gradient through ReLU: multiply by (z > 0) mask
        #   4. Gradient through W_down: self.dW_down = h.T @ grad_relu / batch_size
        #   5. self.db_down = grad_relu.mean(axis=0)
        pass

    def update(self, lr):
        """Update adapter parameters using stored gradients."""
        # TODO: Subtract lr * gradient from each parameter
        pass


# --- TODO: Implement adapted network ---
def adapted_network(x, adapter_1, adapter_2):
    """
    Frozen network with adapters inserted after each layer.
    Layer 1 -> Adapter 1 -> Layer 2 -> Adapter 2
    """
    # TODO: Pass x through frozen_layer(x, W1, b1),
    #       then adapter_1, then (h @ W2 + b2), then adapter_2
    # Return final output and list of caches
    pass


# --- Toy classification data ---
# Task A: separate points based on whether sum of features > 0
# Task B: separate points based on whether first feature > 0
n_samples = 200
X = np.random.randn(n_samples, d)
y_A = (X.sum(axis=1) > 0).astype(float)
y_B = (X[:, 0] > 0).astype(float)

# --- Classification head (trainable per task) ---
class ClassificationHead:
    def __init__(self, d):
        self.W = np.random.randn(d, 1) * 0.01
        self.b = np.zeros(1)

    def forward(self, h):
        logits = h @ self.W + self.b
        probs = 1 / (1 + np.exp(-logits))  # sigmoid
        return probs.squeeze()

    def backward(self, probs, y, h):
        grad = (probs - y).reshape(-1, 1)   # (batch, 1)
        self.dW = h.T @ grad / len(y)
        self.db = grad.mean(axis=0)
        return grad @ self.W.T              # gradient w.r.t. h

    def update(self, lr):
        self.W -= lr * self.dW
        self.b -= lr * self.db


def binary_cross_entropy(probs, y):
    eps = 1e-7
    return -np.mean(y * np.log(probs + eps) + (1 - y) * np.log(1 - probs + eps))


# --- TODO: Training loop ---
def train_task(X, y, adapter_1, adapter_2, head, epochs=200, lr=0.05):
    """
    Train adapters and classification head on a single task.
    The frozen network weights (W1, b1, W2, b2) must NOT be modified.
    """
    for epoch in range(epochs):
        # TODO:
        # 1. Forward: adapted_network(X, adapter_1, adapter_2)
        # 2. Forward: head.forward(output)
        # 3. Compute loss: binary_cross_entropy(probs, y)
        # 4. Backward: head.backward(probs, y, output)
        # 5. Backward through adapter_2, then through frozen layer 2 (gradient only),
        #    then adapter_1
        # 6. Update adapters and head
        # 7. Print loss every 50 epochs
        pass


# --- Main: demonstrate adapter tuning ---
if __name__ == "__main__":
    m = 4  # bottleneck dimension

    # Verify near-identity initialization
    print("=== Near-Identity Check ===")
    adapter_1 = Adapter(d, m)
    adapter_2 = Adapter(d, m)
    test_x = np.random.randn(5, d)
    original_out = frozen_network(test_x)
    adapted_out, _ = adapted_network(test_x, adapter_1, adapter_2)
    diff = np.abs(original_out - adapted_out).max()
    print(f"Max difference at init: {diff:.6f}")
    print(f"Near-identity: {'PASS' if diff < 0.01 else 'FAIL'}")

    # Save frozen weights to verify they don't change
    W1_copy = W1.copy()
    W2_copy = W2.copy()

    # Train Task A
    print("\n=== Training Task A (sum > 0) ===")
    adapter_A1 = Adapter(d, m)
    adapter_A2 = Adapter(d, m)
    head_A = ClassificationHead(d)
    train_task(X, y_A, adapter_A1, adapter_A2, head_A)

    # Verify frozen weights unchanged
    print(f"\nFrozen weights changed: {not (np.array_equal(W1, W1_copy) and np.array_equal(W2, W2_copy))}")

    # Evaluate Task A
    out_A, _ = adapted_network(X, adapter_A1, adapter_A2)
    preds_A = head_A.forward(out_A) > 0.5
    acc_A = (preds_A == y_A).mean()
    print(f"Task A accuracy: {acc_A:.1%}")

    # Train Task B (separate adapters, same frozen network)
    print("\n=== Training Task B (x[0] > 0) ===")
    adapter_B1 = Adapter(d, m)
    adapter_B2 = Adapter(d, m)
    head_B = ClassificationHead(d)
    train_task(X, y_B, adapter_B1, adapter_B2, head_B)

    out_B, _ = adapted_network(X, adapter_B1, adapter_B2)
    preds_B = head_B.forward(out_B) > 0.5
    acc_B = (preds_B == y_B).mean()
    print(f"Task B accuracy: {acc_B:.1%}")

    # Verify Task A still works with its own adapters
    out_A_verify, _ = adapted_network(X, adapter_A1, adapter_A2)
    preds_A_verify = head_A.forward(out_A_verify) > 0.5
    acc_A_verify = (preds_A_verify == y_A).mean()
    print(f"\nTask A accuracy after Task B training: {acc_A_verify:.1%}")
    print(f"No catastrophic forgetting: {'PASS' if acc_A_verify == acc_A else 'FAIL'}")

    # Parameter count comparison
    frozen_params = W1.size + b1.size + W2.size + b2.size
    adapter_params = 4 * (2 * d * m + d + m)  # 4 adapters total per task
    head_params = d + 1
    print(f"\n=== Parameter Efficiency ===")
    print(f"Frozen network params: {frozen_params}")
    print(f"Adapter params per task: {adapter_params}")
    print(f"Head params per task: {head_params}")
    print(f"Total for 2 tasks: {frozen_params + 2 * (adapter_params + head_params)}")
    print(f"Full fine-tuning for 2 tasks: {2 * (frozen_params + head_params)}")

Expected Output

Note: at this toy scale (\(d = 8\), \(m = 4\)), adapters actually use more task-specific parameters than the frozen network. The efficiency advantage only appears at real scale where \(d\) is 768+ and adapters use \(m \ll d\). The project demonstrates the mechanism (near-identity init, frozen backbone, no forgetting), not the scale advantage.

Parameter-Efficient Transfer Learning for NLP: Course

Learning Objectives

Prerequisites

Lesson 1: The Cost of Full Fine-Tuning

Explanation

Worked Example

Exercises

Lesson 2: The Bottleneck – Compress, Transform, Expand

Explanation

Worked Example

Exercises

Lesson 3: Skip Connections and Near-Identity Initialization

Explanation

Worked Example

Exercises

Lesson 4: Placing Adapters Inside the Transformer

Explanation

Worked Example

Exercises

Lesson 5: Adapter Tuning – The Full System

Explanation

Worked Example

Exercises

Comprehension Questions

Hands-On Project

Goal

Specification

Starter Code

Expected Output

Further Reading

Looking Ahead