LoRA: Low-Rank Adaptation of Large Language Models

Authors: Edward Hu, Yelong Shen, Phillip Wallis et al. Year: 2021 Source: arXiv 2106.09685

One-Sentence Summary

LoRA makes it possible to customize a massive language model for a new task by training only a tiny pair of matrices that get added to the original frozen weights, reducing trainable parameters by up to 10,000 times with no loss in quality and no extra inference cost.

Problem Statement

By 2021, the standard recipe in natural language processing was: take a large pre-trained language model (a neural network already trained on huge amounts of text to predict the next word), then fine-tune it (continue training on your specific task data so it specializes). This fine-tuning updates every single parameter in the model. For GPT-3, that means adjusting all 175 billion numbers that define the model’s behavior.

The cost problem is severe. Each fine-tuned copy of GPT-3 produces a 350 GB checkpoint (the file storing all those parameters). If you want 100 customized versions – one for customer support, one for legal documents, one for medical Q&A, and so on – you need to store 100 separate 350 GB files, totaling 35 terabytes. Switching between tasks means loading an entirely new 350 GB model into GPU memory, which takes significant time.

Several approaches had already attempted to fix this. Adapter methods (see Parameter-Efficient Transfer Learning for NLP) insert small trainable bottleneck layers between the existing layers of the model, reducing trainable parameters but adding extra computation at inference time. Prefix tuning prepends learnable tokens to the input, but those tokens eat into the limited sequence length and are difficult to optimize. Neither approach could match the quality of full fine-tuning consistently. The field needed a method that was parameter-efficient, added zero inference latency, and matched or beat full fine-tuning quality.

Key Innovation

Think of a pre-trained language model as a massive reference textbook with 175 billion precisely positioned words. Full fine-tuning is like reprinting the entire textbook with edits on every page for each new course you teach – expensive and wasteful. LoRA’s insight is that you can instead attach a thin, lightweight overlay to each page that shifts the meaning just enough for the new course. When you need a different course, you peel off one overlay and attach another. The overlays are so thin (35 MB vs 350 GB) that you can store thousands of them alongside a single copy of the textbook.

The technical realization of this analogy is rooted in a hypothesis about intrinsic dimensionality. Prior work had shown that large pre-trained models live on a low-dimensional manifold – they have far fewer “real” degrees of freedom than their parameter count suggests. LoRA extends this observation: the change in weights needed to adapt a model to a new task also has low intrinsic rank. Instead of allowing the weight update to be an arbitrary large matrix, LoRA forces it to be the product of two small matrices, which by construction produces a low-rank result.

The key engineering advantage is that these two small matrices can be merged back into the original weight matrix at deployment time. The adapted model \(W = W_0 + BA\) is just a regular weight matrix of the same shape, so inference runs at exactly the same speed as the original model – no extra layers, no extra computation, no architectural changes visible at serving time. This stands in sharp contrast to adapter layers, which remain separate modules that the data must flow through sequentially.

Architecture / Method

LoRA reparameterization: the pretrained weight matrix W is frozen while two small matrices A and B are trained in parallel

Figure 1: The LoRA reparameterization. The pre-trained weights \(W \in \mathbb{R}^{d \times d}\) are frozen (no gradient updates). A low-rank update \(\Delta W = BA\) is learned through two small matrices: \(A \in \mathbb{R}^{r \times d}\) (initialized from a Gaussian) and \(B \in \mathbb{R}^{d \times r}\) (initialized to zero). At deployment, \(W + BA\) is merged into a single matrix, adding zero inference latency.

LoRA modifies the forward pass of specific weight matrices inside a Transformer model (see Attention Is All You Need for the Transformer architecture). The Transformer’s self-attention mechanism contains four weight matrices (\(W_q\), \(W_k\), \(W_v\), \(W_o\)) that project the input into queries, keys, values, and an output space. Each of these is a matrix of size \(d_{\text{model}} \times d_{\text{model}}\) – for GPT-3, that is \(12{,}288 \times 12{,}288\), or about 150 million numbers per matrix.

For each weight matrix the authors choose to adapt (they find that adapting \(W_q\) and \(W_v\) works best), LoRA adds a parallel low-rank branch. The original weight matrix \(W_0\) is frozen – it never receives gradient updates during training. Alongside it, two new small matrices are introduced: \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\), where \(r\) is a small rank (often 1, 2, 4, or 8) and \(d\) and \(k\) are the dimensions of the original weight matrix. Their product \(BA\) has the same shape as \(W_0\) but is constrained to rank \(r\).

The initialization is deliberate: \(A\) is initialized from a random Gaussian distribution (so the low-rank subspace starts in a random direction), and \(B\) is initialized to all zeros. This means \(\Delta W = BA = 0\) at the start of training, so the model begins behaving exactly like the pre-trained original. As training proceeds, \(A\) and \(B\) are updated by gradients, and \(\Delta W\) gradually drifts away from zero to encode task-specific knowledge.

To make the method robust to different choices of rank, the authors scale the low-rank update by \(\frac{\alpha}{r}\), where \(\alpha\) is a constant typically set equal to the first rank tried. This ensures that if you later experiment with a different rank, you do not need to re-tune the learning rate – the scaling automatically compensates.

At deployment time, the adapted weight is computed once as \(W = W_0 + BA\) and stored. From that point forward, inference uses \(W\) directly, with no knowledge that it was constructed from a frozen base plus a low-rank update. To switch tasks, you subtract the current \(BA\), add the new \(B'A'\), and you are serving a different specialization – an operation that takes milliseconds.

Mathematical Foundations

1. Full Fine-Tuning Objective

\[\max_{\Phi} \sum_{(x,y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log\left(P_{\Phi}(y_t \mid x, y_{<t})\right)\]

This is the standard language modeling objective: find the parameters \(\Phi\) that maximize the probability of producing the correct output tokens one at a time. In full fine-tuning, \(\Phi\) has the same dimension as the pre-trained weights \(\Phi_0\) – you update everything. For GPT-3, this means storing and optimizing over 175 billion parameters, which requires over 1.2 TB of GPU memory when using the Adam optimizer (which keeps two extra copies of every parameter for momentum tracking).

2. LoRA Re-Parameterized Objective

\[\max_{\Theta} \sum_{(x,y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log\left(p_{\Phi_0 + \Delta\Phi(\Theta)}(y_t \mid x, y_{<t})\right)\]

This is the same objective, but now optimization runs over \(\Theta\) instead of \(\Phi\). The frozen pre-trained weights \(\Phi_0\) provide the base behavior; \(\Delta\Phi(\Theta)\) encodes only what needs to change for the new task. For GPT-3 with rank \(r = 4\) applied to query and value matrices, \(|\Theta|\) drops from 175 billion to about 4.7 million – a 37,000x reduction.

3. LoRA Forward Pass

\[h = W_0 x + \Delta W x = W_0 x + BAx\]

This is the core equation of LoRA. The original forward pass computes \(h = W_0 x\). LoRA adds a parallel path: multiply \(x\) by \(A\) (projecting from dimension \(k\) down to \(r\)), then multiply by \(B\) (projecting back up from \(r\) to \(d\)). The result is added to the original output. Because \(B\) starts at zero, the initial behavior is identical to the pre-trained model. At inference time, you compute \(W = W_0 + BA\) once and discard \(A\) and \(B\) – the model runs at exactly the same speed.

To build concrete intuition, consider a weight matrix in GPT-3 with \(d = k = 12{,}288\) and \(r = 4\). The original \(W_0\) has \(12{,}288 \times 12{,}288 = 150{,}994{,}944\) parameters. The LoRA matrices have \(12{,}288 \times 4 + 4 \times 12{,}288 = 98{,}304\) parameters – just 0.065% of the original. Yet this tiny update is enough to specialize the model for a new task.

4. Scaling Factor

\[h = W_0 x + \frac{\alpha}{r} BAx\]

The scaling factor \(\frac{\alpha}{r}\) normalizes the magnitude of the low-rank update relative to the rank. Without it, doubling \(r\) would roughly double the magnitude of \(\Delta W x\), requiring you to cut the learning rate in half. With this scaling, the effective learning rate stays approximately constant across different rank choices, making hyperparameter tuning easier. In practice, the authors set \(\alpha\) equal to the first \(r\) they try and leave it fixed.

5. Number of Trainable Parameters

\[|\Theta| = 2 \times \hat{L}_{\text{LoRA}} \times d_{\text{model}} \times r\]

For GPT-3 with \(r = 4\), adapting \(W_q\) and \(W_v\) across all 96 layers: \(|\Theta| = 2 \times 192 \times 12{,}288 \times 4 = 18{,}874{,}368 \approx 18.9\text{M}\). This is 0.011% of the original 175 billion parameters. Even at \(r = 1\), the model performs competitively.

Results

LoRA was evaluated on three classes of models spanning different scales: RoBERTa (125M and 355M parameters), DeBERTa XXL (1.5B), GPT-2 (345M and 774M), and GPT-3 (175B).

On the GLUE natural language understanding benchmark with RoBERTa base, LoRA achieved 87.2% average accuracy with only 0.3M trainable parameters, outperforming full fine-tuning (86.4% with 125M parameters) and adapter-based methods at the same parameter budget. The DeBERTa XXL results were even more striking:

Method Trainable Params MNLI SST-2 MRPC CoLA RTE STS-B Avg.
DeBERTa XXL (FT) 1,500M 91.8 97.2 92.0 72.0 93.9 92.9 91.1
DeBERTa XXL (LoRA) 4.7M 91.9 96.9 92.6 72.4 94.9 93.0 91.3

LoRA matched or exceeded the fully fine-tuned DeBERTa XXL on every task while training only 0.3% of the parameters.

On GPT-3 175B, LoRA outperformed all other parameter-efficient methods and matched or exceeded full fine-tuning:

Method Trainable Params WikiSQL MNLI-m SAMSum (R1/R2/RL)
GPT-3 (FT) 175,255.8M 73.8 89.5 52.0/28.0/44.5
GPT-3 (AdapterH) 7.1M 71.9 89.8 53.0/28.9/44.8
GPT-3 (LoRA) 4.7M 73.4 91.7 53.8/29.8/45.9

LoRA with 4.7M parameters (0.003% of the model) beat full fine-tuning on MNLI and SAMSum, and came within 0.4% on WikiSQL. The inference latency was identical to the base model, while adapter methods added 5-30% overhead depending on batch size and sequence length.

A particularly surprising finding came from the analysis of optimal rank: a rank as small as \(r = 1\) was sufficient for strong performance when adapting both \(W_q\) and \(W_v\). The authors showed through singular value decomposition that the top singular direction was shared between models trained with \(r = 8\) and \(r = 64\), while other directions contained mostly noise – strong evidence that the weight updates during adaptation truly are rank-deficient.

Limitations

Impact and Legacy

LoRA became one of the most widely adopted techniques in machine learning within two years of publication. It fundamentally changed how practitioners think about model customization: rather than training and storing separate copies of a large model, they train and swap lightweight “adapters” (LoRA modules) on top of a shared frozen base. Services like Hugging Face’s PEFT library, which implements LoRA as its primary method, enabled this workflow to reach millions of developers.

The impact extended well beyond NLP. LoRA was adopted for fine-tuning image generation models (Stable Diffusion LoRA modules became a community standard), speech models, and multi-modal systems. The technique spawned numerous variants: QLoRA combined LoRA with 4-bit quantization to fine-tune 65B-parameter models on a single 48GB GPU; DoRA decomposed weights into magnitude and direction components before applying low-rank updates; LoRA+ proposed different learning rates for \(A\) and \(B\); and rank-adaptive methods like AdaLoRA dynamically allocated rank budget across layers.

Perhaps most importantly, LoRA democratized large model customization. Before LoRA, fine-tuning a model like GPT-3 required enterprise-grade GPU clusters. After LoRA, individuals could fine-tune models with billions of parameters on consumer hardware. This shift – from a world where only large organizations could customize frontier models to one where anyone with a single GPU could do it – is arguably the paper’s most lasting contribution to the field.

Prerequisites

To understand this paper, the reader should be familiar with:

Connections