Authors: Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski et al. Year: 2019 Source: arXiv:1902.00751
Instead of creating a full copy of a large language model for every new task, this paper shows you can insert tiny “adapter” modules into the existing model and train only those, achieving nearly the same accuracy while updating less than 4% of the parameters.
By 2019, the dominant approach to solving a new text task – sentiment analysis, question answering, textual similarity – was to take a large pre-trained model like BERT (see BERT: Pre-training of Deep Bidirectional Transformers) and fine-tune all of its parameters on the new task. BERT-Large has 330 million parameters. If you need to serve ten different tasks, you need ten separate copies of those 330 million parameters, because each task’s fine-tuning changes the weights differently. That is 3.3 billion parameters total.
This creates two practical problems. First, storage and serving cost: cloud services that handle many customer tasks would need to store and load entirely separate models for each one. Second, extensibility: when a new task arrives, you start from scratch with another full copy. There is no way to add a new capability to the existing model without risking damage to what it already knows – a problem called catastrophic forgetting (where training on new data causes the network to lose its ability to perform previously learned tasks).
The two established transfer learning approaches each had drawbacks. Feature-based transfer (extracting fixed representations from a pre-trained model and feeding them into a new, small task-specific model) was parameter efficient but performed worse. Fine-tuning (updating all the pre-trained weights) performed better but required a complete copy of the model per task. The field needed a method that achieved fine-tuning-level performance with feature-extraction-level efficiency.
Think of a large pre-trained model like a Swiss Army knife that already has all the core tools built in – blade, screwdriver, scissors. Fine-tuning is like melting down the entire knife and recasting it into a slightly different shape for every job. The adapter approach instead snaps on a small, specialized attachment – like a bottle opener clip – that redirects how the existing tools are used without modifying the tools themselves. Each new task gets its own tiny clip, while the knife stays unchanged and shared.
Technically, the paper introduces bottleneck adapter modules: small two-layer neural networks inserted at specific points inside each layer of a Transformer. The key design has two properties. First, the adapters are small – they compress the representation down to a low dimension and then expand it back, so they contain very few parameters relative to the main model. Second, the adapters are initialized to approximate the identity function (an operation that outputs exactly what it receives as input), meaning the model behaves identically to the original pre-trained model before any task-specific training begins. Only the adapter parameters (and the layer normalization parameters) are trained; the original pre-trained weights stay frozen.
This approach achieves a clean separation: the frozen base model provides general language understanding shared across all tasks, while each set of tiny adapters encodes what is unique about a particular task. Because the base model never changes, you cannot forget previous tasks, and the total model size grows very slowly as you add more tasks.
Figure 2: Left: The adapter module is inserted twice per Transformer layer – after the multi-head attention projection and after the feedforward sub-layer. Right: Each adapter is a bottleneck that projects down to a small dimension \(m\), applies a nonlinearity, and projects back up, with a skip-connection. Only the adapter parameters, layer normalization, and the final classification layer (shown in green) are trained; everything else stays frozen.
The method builds on the standard Transformer architecture (see Attention Is All You Need). Each Transformer layer contains two sub-layers: a multi-head self-attention block and a position-wise feedforward block. Each sub-layer has a residual connection (also called a skip-connection, where the input is added to the output) followed by layer normalization (a technique that stabilizes training by normalizing activations).
The adapter module is inserted twice per Transformer layer: once after the multi-head attention projection, and once after the feedforward sub-layer. In both cases, the adapter sits after the sub-layer’s output projection but before the residual addition and layer normalization. The placement is deliberate: by inserting adapters at these two points, the method can modify the representations flowing through both the attention pathway and the feedforward pathway.
Each adapter module itself is a bottleneck. It takes the \(d\)-dimensional output of the sub-layer (for BERT-Base, \(d = 768\)), projects it down to a much smaller dimension \(m\) using a learned weight matrix (the “down-projection”), applies a nonlinear activation function (the paper uses a ReLU, which simply sets all negative values to zero), then projects back up to \(d\) dimensions with another weight matrix (the “up-projection”). The adapter also has its own internal skip-connection: the original \(d\)-dimensional input is added to the adapter’s output. This internal skip-connection is what enables the near-identity initialization – if the projection weights start near zero, the adapter output is approximately equal to its input, and the whole module acts like it is not there.
For a concrete example: suppose BERT-Base has \(d = 768\) and we choose an adapter bottleneck size of \(m = 64\). The down-projection is a \(768 \times 64\) matrix (49,152 weights), plus 64 bias terms. The up-projection is a \(64 \times 768\) matrix (49,152 weights), plus 768 bias terms. That is 98,304 weights plus 832 biases = 99,136 parameters per adapter. With two adapters per Transformer layer and 12 layers in BERT-Base, that is \(12 \times 2 \times 99{,}136 = 2{,}379{,}264\) total adapter parameters – about 2.1% of BERT-Base’s 110 million parameters.
During training, only three things are updated: the adapter module parameters, the layer normalization parameters (which are re-trained per task since they contain only \(2d\) parameters per layer), and the final classification head. The entire pre-trained BERT model remains frozen. The weights in the adapter modules are initialized by drawing from a zero-mean Gaussian distribution with standard deviation \(10^{-2}\), truncated to two standard deviations. This small initialization ensures the adapters start near the identity function.
This paper is notable for its simplicity – there are no complex loss functions or training objectives beyond standard supervised fine-tuning. The mathematical content focuses on formalizing the three transfer learning strategies and the adapter’s parameter count.
1. Feature-based transfer
\[\chi_v(\phi_w(x))\]
where \(\phi_w\) is the pre-trained network with frozen parameters \(w\), \(\chi_v\) is a new task-specific function with parameters \(v\), and \(x\) is the input. Only \(v\) is trained. In plain language: you run the input through the frozen pre-trained model to extract features, then feed those features into a separately trained task-specific model. This is parameter efficient (you only train \(v\)) but limits how much the system can adapt, since the features from \(\phi_w\) cannot be modified.
2. Adapter-based transfer
\[\psi_{w,v}(x) \quad \text{where} \quad \psi_{w,v_0}(x) \approx \phi_w(x)\]
where \(\psi_{w,v}\) is a modified network that incorporates both the original parameters \(w\) (frozen, copied from pre-training) and new adapter parameters \(v\). The initial adapter parameters \(v_0\) are set so the modified network behaves approximately like the original. During training, only \(v\) is updated. This matters because it means training can start from a stable, known-good state rather than from a random modification of the pre-trained model.
3. Parameter efficiency constraint
\[|v| \ll |w| \implies \text{total parameters for } N \text{ tasks} \approx |w| + N \cdot |v| \approx |w|\]
where \(|v|\) is the number of adapter parameters and \(|w|\) is the number of pre-trained parameters. If \(|v|\) is much smaller than \(|w|\), the total model size for \(N\) tasks stays close to \(|w|\) instead of growing as \(N \times |w|\) (which is what full fine-tuning requires). For example, with adapters using 3.6% task-specific parameters, 9 GLUE (General Language Understanding Evaluation, a suite of 9 natural language understanding benchmarks) tasks require \(1.3 \times |w|\) total, compared to \(9 \times |w|\) for fine-tuning.
4. Adapter parameter count per layer
\[P_{\text{adapter}} = 2md + d + m\]
where \(d\) is the model’s hidden dimension (768 for BERT-Base, 1024 for BERT-Large), \(m\) is the bottleneck dimension (a hyperparameter, typically 8 to 256), \(2md\) accounts for the two projection matrices (down-projection: \(d \times m\), up-projection: \(m \times d\)), and \(d + m\) accounts for the bias terms. For a worked example: with \(d = 768\) and \(m = 64\), one adapter has \(2 \times 64 \times 768 + 768 + 64 = 98{,}304 + 832 = 99{,}136\) parameters.
5. Adapter computation (bottleneck transformation)
\[h \leftarrow h + f(h W_{\text{down}}) W_{\text{up}}\]
where \(h \in \mathbb{R}^d\) is the input hidden representation, \(W_{\text{down}} \in \mathbb{R}^{d \times m}\) is the down-projection matrix, \(W_{\text{up}} \in \mathbb{R}^{m \times d}\) is the up-projection matrix, and \(f\) is a nonlinear activation function (ReLU in the paper’s experiments). The addition of \(h\) on the left side is the skip-connection. This equation describes the full forward pass of a single adapter: compress, apply nonlinearity, expand, add residual. When \(W_{\text{down}}\) and \(W_{\text{up}}\) are initialized near zero, the output is approximately \(h + 0 = h\), giving the near-identity behavior.
The paper evaluates adapter tuning against full fine-tuning across three benchmarks: GLUE (9 tasks), 17 additional text classification tasks, and SQuAD v1.1 (Stanford Question Answering Dataset) question answering.
GLUE Benchmark (using BERT-Large, 330M parameters):
| Method | Trained params/task | Total params (9 tasks) | Average GLUE Score |
|---|---|---|---|
| Full fine-tuning | 100% | 9.0x | 80.4 |
| Adapters (best size per task: 8-256) | 3.6% | 1.3x | 80.0 |
| Adapters (fixed size 64) | 2.1% | 1.2x | 79.6 |
The adapter approach reaches within 0.4 points of full fine-tuning while requiring 7x fewer total parameters. Even fixing a single adapter size across all tasks (instead of tuning it per task) only costs an additional 0.4 points.
17 Additional Classification Tasks (using BERT-Base, 110M parameters):
| Method | Trained params/task | Total params (17 tasks) | Average Accuracy |
|---|---|---|---|
| Full fine-tuning | 100% | 17x | 73.7 |
| Variable fine-tuning (top n layers) | 52.9% | 9.9x | 74.0 |
| Adapters | 1.14% | 1.19x | 73.3 |
Adapters fall only 0.4% behind full fine-tuning, but the total model size for all 17 tasks is just 1.19x the base model – compared to 17x for full fine-tuning. Variable fine-tuning (freezing lower layers) offers a middle ground but still requires nearly 10x the parameters.
SQuAD v1.1 (question answering, BERT-Base): Adapters of size 64 (2% of parameters) achieve an F1 score (the harmonic mean of precision and recall) of 90.4%, compared to 90.7% for full fine-tuning. Even extremely small adapters (size 2, just 0.1% of parameters) achieve 89.9% F1.
The ablation analysis reveals that adapters in the lower Transformer layers contribute less to performance than those in higher layers. Removing adapters from layers 0-4 barely affects MNLI (Multi-Genre Natural Language Inference) accuracy, while removing all adapters drops performance to majority-class baseline (37% on MNLI). This aligns with the intuition that lower layers learn general features shared across tasks, while higher layers learn task-specific features.
This paper established adapter modules as a practical paradigm for parameter-efficient fine-tuning (PEFT) of large pre-trained models. Before this work, the transfer learning community largely viewed fine-tuning and feature extraction as the only two options. Adapters opened a third path – one that modifies the internal computation of a frozen model through small, trainable inserts – and this idea spawned an entire research direction.
The paper’s bottleneck architecture became the prototype for a family
of methods. LoRA (Low-Rank Adaptation, Hu et al., 2021) refined the idea
by reparameterizing the weight updates as low-rank matrices applied in
parallel rather than serial bottleneck layers, eliminating inference
latency overhead. Prefix tuning and prompt tuning offered even more
minimal approaches by only modifying the input representations. The
broader PEFT ecosystem (including libraries like Hugging Face’s
peft package) traces its lineage directly to this paper’s
core insight: you can adapt a large model effectively by training a
small fraction of carefully placed parameters.
The practical impact has been enormous. As language models grew from hundreds of millions to hundreds of billions of parameters, full fine-tuning became impractical not just for storage but for compute. Parameter-efficient methods became essential infrastructure for deploying large models across many tasks in production. The paper’s framing around cloud services and sequential task arrival proved prescient – this is exactly the setting where modern LLM-based services operate. Every major LLM serving platform now supports some form of adapter or PEFT-based customization, and the adapter pattern has expanded beyond NLP into vision, multimodal, and speech models.
To fully understand this paper, you need: