LoRA: Low-Rank Adaptation of Large Language Models

Authors: Edward Hu, Yelong Shen, Phillip Wallis et al. Year: 2021 Source: arXiv 2106.09685

One-Sentence Summary

LoRA makes it possible to customize a massive language model for a new task by training only a tiny pair of matrices that get added to the original frozen weights, reducing trainable parameters by up to 10,000 times with no loss in quality and no extra inference cost.

Problem Statement

By 2021, the standard recipe in natural language processing was: take a large pre-trained language model (a neural network already trained on huge amounts of text to predict the next word), then fine-tune it (continue training on your specific task data so it specializes). This fine-tuning updates every single parameter in the model. For GPT-3, that means adjusting all 175 billion numbers that define the model’s behavior.

The cost problem is severe. Each fine-tuned copy of GPT-3 produces a 350 GB checkpoint (the file storing all those parameters). If you want 100 customized versions – one for customer support, one for legal documents, one for medical Q&A, and so on – you need to store 100 separate 350 GB files, totaling 35 terabytes. Switching between tasks means loading an entirely new 350 GB model into GPU memory, which takes significant time.

Several approaches had already attempted to fix this. Adapter methods (see Parameter-Efficient Transfer Learning for NLP) insert small trainable bottleneck layers between the existing layers of the model, reducing trainable parameters but adding extra computation at inference time. Prefix tuning prepends learnable tokens to the input, but those tokens eat into the limited sequence length and are difficult to optimize. Neither approach could match the quality of full fine-tuning consistently. The field needed a method that was parameter-efficient, added zero inference latency, and matched or beat full fine-tuning quality.

Key Innovation

Think of a pre-trained language model as a massive reference textbook with 175 billion precisely positioned words. Full fine-tuning is like reprinting the entire textbook with edits on every page for each new course you teach – expensive and wasteful. LoRA’s insight is that you can instead attach a thin, lightweight overlay to each page that shifts the meaning just enough for the new course. When you need a different course, you peel off one overlay and attach another. The overlays are so thin (35 MB vs 350 GB) that you can store thousands of them alongside a single copy of the textbook.

The technical realization of this analogy is rooted in a hypothesis about intrinsic dimensionality. Prior work had shown that large pre-trained models live on a low-dimensional manifold – they have far fewer “real” degrees of freedom than their parameter count suggests. LoRA extends this observation: the change in weights needed to adapt a model to a new task also has low intrinsic rank. Instead of allowing the weight update to be an arbitrary large matrix, LoRA forces it to be the product of two small matrices, which by construction produces a low-rank result.

The key engineering advantage is that these two small matrices can be merged back into the original weight matrix at deployment time. The adapted model \(W = W_0 + BA\) is just a regular weight matrix of the same shape, so inference runs at exactly the same speed as the original model – no extra layers, no extra computation, no architectural changes visible at serving time. This stands in sharp contrast to adapter layers, which remain separate modules that the data must flow through sequentially.

Architecture / Method

LoRA reparameterization: the pretrained weight matrix W is frozen while two small matrices A and B are trained in parallel

Figure 1: The LoRA reparameterization. The pre-trained weights \(W \in \mathbb{R}^{d \times d}\) are frozen (no gradient updates). A low-rank update \(\Delta W = BA\) is learned through two small matrices: \(A \in \mathbb{R}^{r \times d}\) (initialized from a Gaussian) and \(B \in \mathbb{R}^{d \times r}\) (initialized to zero). At deployment, \(W + BA\) is merged into a single matrix, adding zero inference latency.

LoRA modifies the forward pass of specific weight matrices inside a Transformer model (see Attention Is All You Need for the Transformer architecture). The Transformer’s self-attention mechanism contains four weight matrices (\(W_q\), \(W_k\), \(W_v\), \(W_o\)) that project the input into queries, keys, values, and an output space. Each of these is a matrix of size \(d_{\text{model}} \times d_{\text{model}}\) – for GPT-3, that is \(12{,}288 \times 12{,}288\), or about 150 million numbers per matrix.

For each weight matrix the authors choose to adapt (they find that adapting \(W_q\) and \(W_v\) works best), LoRA adds a parallel low-rank branch. The original weight matrix \(W_0\) is frozen – it never receives gradient updates during training. Alongside it, two new small matrices are introduced: \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\), where \(r\) is a small rank (often 1, 2, 4, or 8) and \(d\) and \(k\) are the dimensions of the original weight matrix. Their product \(BA\) has the same shape as \(W_0\) but is constrained to rank \(r\).

The initialization is deliberate: \(A\) is initialized from a random Gaussian distribution (so the low-rank subspace starts in a random direction), and \(B\) is initialized to all zeros. This means \(\Delta W = BA = 0\) at the start of training, so the model begins behaving exactly like the pre-trained original. As training proceeds, \(A\) and \(B\) are updated by gradients, and \(\Delta W\) gradually drifts away from zero to encode task-specific knowledge.

To make the method robust to different choices of rank, the authors scale the low-rank update by \(\frac{\alpha}{r}\), where \(\alpha\) is a constant typically set equal to the first rank tried. This ensures that if you later experiment with a different rank, you do not need to re-tune the learning rate – the scaling automatically compensates.

At deployment time, the adapted weight is computed once as \(W = W_0 + BA\) and stored. From that point forward, inference uses \(W\) directly, with no knowledge that it was constructed from a frozen base plus a low-rank update. To switch tasks, you subtract the current \(BA\), add the new \(B'A'\), and you are serving a different specialization – an operation that takes milliseconds.

Mathematical Foundations

1. Full Fine-Tuning Objective

\[\max_{\Phi} \sum_{(x,y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log\left(P_{\Phi}(y_t \mid x, y_{<t})\right)\]

\(\Phi\) – the full set of model parameters (all 175 billion for GPT-3)
\(\mathcal{Z} = \{(x_i, y_i)\}\) – the training dataset of input-output pairs
\(x\) – an input sequence (e.g., a natural language question)
\(y_t\) – the \(t\)-th token of the target output
\(y_{<t}\) – all target tokens before position \(t\)
\(P_{\Phi}(y_t \mid x, y_{<t})\) – the probability the model assigns to token \(y_t\) given the input and all preceding target tokens

This is the standard language modeling objective: find the parameters \(\Phi\) that maximize the probability of producing the correct output tokens one at a time. In full fine-tuning, \(\Phi\) has the same dimension as the pre-trained weights \(\Phi_0\) – you update everything. For GPT-3, this means storing and optimizing over 175 billion parameters, which requires over 1.2 TB of GPU memory when using the Adam optimizer (which keeps two extra copies of every parameter for momentum tracking).

2. LoRA Re-Parameterized Objective

\[\max_{\Theta} \sum_{(x,y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log\left(p_{\Phi_0 + \Delta\Phi(\Theta)}(y_t \mid x, y_{<t})\right)\]

\(\Theta\) – the small set of LoRA parameters (the entries of all \(A\) and \(B\) matrices)
\(\Phi_0\) – the frozen pre-trained weights (never updated)
\(\Delta\Phi(\Theta)\) – the weight change expressed as a function of the small parameter set \(\Theta\)
\(|\Theta| \ll |\Phi_0|\) – \(\Theta\) can be as small as 0.01% of \(\Phi_0\)

This is the same objective, but now optimization runs over \(\Theta\) instead of \(\Phi\). The frozen pre-trained weights \(\Phi_0\) provide the base behavior; \(\Delta\Phi(\Theta)\) encodes only what needs to change for the new task. For GPT-3 with rank \(r = 4\) applied to query and value matrices, \(|\Theta|\) drops from 175 billion to about 4.7 million – a 37,000x reduction.

3. LoRA Forward Pass

\[h = W_0 x + \Delta W x = W_0 x + BAx\]

\(W_0 \in \mathbb{R}^{d \times k}\) – the frozen pre-trained weight matrix
\(x\) – the input activation vector (or matrix of activations for a batch)
\(B \in \mathbb{R}^{d \times r}\) – the trainable up-projection, mapping from rank \(r\) back to dimension \(d\) (initialized to zero)
\(A \in \mathbb{R}^{r \times k}\) – the trainable down-projection, mapping from dimension \(k\) down to rank \(r\) (initialized from a Gaussian)
\(r \ll \min(d, k)\) – the rank of the adaptation (e.g., \(r = 4\) when \(d = 12{,}288\))
\(h\) – the output activation

This is the core equation of LoRA. The original forward pass computes \(h = W_0 x\). LoRA adds a parallel path: multiply \(x\) by \(A\) (projecting from dimension \(k\) down to \(r\)), then multiply by \(B\) (projecting back up from \(r\) to \(d\)). The result is added to the original output. Because \(B\) starts at zero, the initial behavior is identical to the pre-trained model. At inference time, you compute \(W = W_0 + BA\) once and discard \(A\) and \(B\) – the model runs at exactly the same speed.

To build concrete intuition, consider a weight matrix in GPT-3 with \(d = k = 12{,}288\) and \(r = 4\). The original \(W_0\) has \(12{,}288 \times 12{,}288 = 150{,}994{,}944\) parameters. The LoRA matrices have \(12{,}288 \times 4 + 4 \times 12{,}288 = 98{,}304\) parameters – just 0.065% of the original. Yet this tiny update is enough to specialize the model for a new task.

4. Scaling Factor

\[h = W_0 x + \frac{\alpha}{r} BAx\]

\(\alpha\) – a constant hyperparameter, typically set to the first value of \(r\) tried
\(r\) – the rank of the LoRA matrices

The scaling factor \(\frac{\alpha}{r}\) normalizes the magnitude of the low-rank update relative to the rank. Without it, doubling \(r\) would roughly double the magnitude of \(\Delta W x\), requiring you to cut the learning rate in half. With this scaling, the effective learning rate stays approximately constant across different rank choices, making hyperparameter tuning easier. In practice, the authors set \(\alpha\) equal to the first \(r\) they try and leave it fixed.

5. Number of Trainable Parameters

\[|\Theta| = 2 \times \hat{L}_{\text{LoRA}} \times d_{\text{model}} \times r\]

\(\hat{L}_{\text{LoRA}}\) – the number of weight matrices that receive LoRA adaptation (e.g., 2 per layer if adapting \(W_q\) and \(W_v\), times 96 layers = 192 for GPT-3)
\(d_{\text{model}}\) – the model’s hidden dimension (12,288 for GPT-3)
\(r\) – the rank
The factor of 2 accounts for both matrices \(A\) and \(B\) in each LoRA pair

For GPT-3 with \(r = 4\), adapting \(W_q\) and \(W_v\) across all 96 layers: \(|\Theta| = 2 \times 192 \times 12{,}288 \times 4 = 18{,}874{,}368 \approx 18.9\text{M}\). This is 0.011% of the original 175 billion parameters. Even at \(r = 1\), the model performs competitively.

Results

LoRA was evaluated on three classes of models spanning different scales: RoBERTa (125M and 355M parameters), DeBERTa XXL (1.5B), GPT-2 (345M and 774M), and GPT-3 (175B).

On the GLUE natural language understanding benchmark with RoBERTa base, LoRA achieved 87.2% average accuracy with only 0.3M trainable parameters, outperforming full fine-tuning (86.4% with 125M parameters) and adapter-based methods at the same parameter budget. The DeBERTa XXL results were even more striking:

Method	Trainable Params	MNLI	SST-2	MRPC	CoLA	RTE	STS-B	Avg.
DeBERTa XXL (FT)	1,500M	91.8	97.2	92.0	72.0	93.9	92.9	91.1
DeBERTa XXL (LoRA)	4.7M	91.9	96.9	92.6	72.4	94.9	93.0	91.3

LoRA matched or exceeded the fully fine-tuned DeBERTa XXL on every task while training only 0.3% of the parameters.

On GPT-3 175B, LoRA outperformed all other parameter-efficient methods and matched or exceeded full fine-tuning:

Method	Trainable Params	WikiSQL	MNLI-m	SAMSum (R1/R2/RL)
GPT-3 (FT)	175,255.8M	73.8	89.5	52.0/28.0/44.5
GPT-3 (AdapterH)	7.1M	71.9	89.8	53.0/28.9/44.8
GPT-3 (LoRA)	4.7M	73.4	91.7	53.8/29.8/45.9

LoRA with 4.7M parameters (0.003% of the model) beat full fine-tuning on MNLI and SAMSum, and came within 0.4% on WikiSQL. The inference latency was identical to the base model, while adapter methods added 5-30% overhead depending on batch size and sequence length.

A particularly surprising finding came from the analysis of optimal rank: a rank as small as \(r = 1\) was sufficient for strong performance when adapting both \(W_q\) and \(W_v\). The authors showed through singular value decomposition that the top singular direction was shared between models trained with \(r = 8\) and \(r = 64\), while other directions contained mostly noise – strong evidence that the weight updates during adaptation truly are rank-deficient.

Limitations

Task-switching in a single batch is not straightforward. If you merge \(BA\) into \(W\) to eliminate inference latency, you cannot process inputs from different tasks in the same batch since each task has different \(A\) and \(B\) matrices. This forces a choice between zero-latency serving and multi-task batching.
Only attention weights were studied. The paper applies LoRA only to the self-attention projection matrices (\(W_q\), \(W_v\)) and freezes the MLP (multi-layer perceptron, the feed-forward sub-network within each Transformer block) layers. Later work (e.g., QLoRA) showed that adapting more weight types can improve results, suggesting the paper’s experiments left performance on the table.
Low rank may not suffice for distant domain shifts. The authors acknowledge that if the target task is in a very different language or domain from pre-training, a small rank might not capture enough of the required change. The low-rank hypothesis holds best when adaptation requires subtle adjustments rather than fundamental restructuring.
No theoretical guarantee on rank sufficiency. The claim that weight updates are low-rank is empirical, not proven. The paper provides compelling evidence through subspace analysis but no formal bound on what rank is “enough” for a given task.
Comparison scope is limited to a few model families. All experiments use Transformer language models. The paper does not test LoRA on vision models, multi-modal models, or non-Transformer architectures, though the method is architecturally general.
The paper does not address quantization. Combining LoRA with model quantization (representing weights in fewer bits) can further reduce memory, but this combination was not explored. QLoRA, published two years later, showed this pairing is highly effective – a gap in the original work.

Impact and Legacy

LoRA became one of the most widely adopted techniques in machine learning within two years of publication. It fundamentally changed how practitioners think about model customization: rather than training and storing separate copies of a large model, they train and swap lightweight “adapters” (LoRA modules) on top of a shared frozen base. Services like Hugging Face’s PEFT library, which implements LoRA as its primary method, enabled this workflow to reach millions of developers.

The impact extended well beyond NLP. LoRA was adopted for fine-tuning image generation models (Stable Diffusion LoRA modules became a community standard), speech models, and multi-modal systems. The technique spawned numerous variants: QLoRA combined LoRA with 4-bit quantization to fine-tune 65B-parameter models on a single 48GB GPU; DoRA decomposed weights into magnitude and direction components before applying low-rank updates; LoRA+ proposed different learning rates for \(A\) and \(B\); and rank-adaptive methods like AdaLoRA dynamically allocated rank budget across layers.

Perhaps most importantly, LoRA democratized large model customization. Before LoRA, fine-tuning a model like GPT-3 required enterprise-grade GPU clusters. After LoRA, individuals could fine-tune models with billions of parameters on consumer hardware. This shift – from a world where only large organizations could customize frontier models to one where anyone with a single GPU could do it – is arguably the paper’s most lasting contribution to the field.

Prerequisites

To understand this paper, the reader should be familiar with:

Matrix multiplication and rank: understanding that a matrix of rank \(r\) can be decomposed into the product of two smaller matrices, and that low-rank matrices span only a small subspace of all possible matrices
Neural network basics: what parameters are, what a forward pass computes, and what gradient descent does (updates parameters to reduce a loss function)
The Transformer architecture: how self-attention works with query, key, and value projections (see Attention Is All You Need)
Pre-training and fine-tuning: the two-phase paradigm where a model learns general language knowledge first, then specializes on a task (see Improving Language Understanding by Generative Pre-Training and BERT)
Adapter-based transfer learning: the idea of inserting small trainable modules into a frozen model (see Parameter-Efficient Transfer Learning for NLP)

Connections

Builds on the Transformer architecture (see Attention Is All You Need): LoRA targets the specific weight matrices (\(W_q\), \(W_k\), \(W_v\), \(W_o\)) inside the Transformer’s self-attention mechanism. The paper’s notation and experimental framework assume this architecture throughout.
Extends the pre-training/fine-tuning paradigm from GPT (see Improving Language Understanding by Generative Pre-Training) and BERT (see BERT): LoRA’s entire motivation is that fine-tuning these pre-trained models in the standard way (updating all parameters) is too expensive at scale.
Directly competes with and improves upon adapter tuning (see Parameter-Efficient Transfer Learning for NLP): Both LoRA and adapters insert small trainable components into a frozen model. The critical difference is that adapters add sequential bottleneck layers (introducing inference latency), while LoRA adds parallel low-rank branches that merge into the base weights at deployment (zero inference latency). LoRA achieves comparable or better accuracy with fewer parameters.
Shares the parameter-efficiency motivation with prefix tuning and prompt tuning: all three methods reduce the number of trainable parameters, but LoRA avoids the sequence-length reduction and optimization difficulties that plague prefix-based methods.
Builds on the intrinsic dimensionality hypothesis: prior work showed that pre-trained models operate on a low-dimensional manifold (a lower-dimensional surface embedded in the high-dimensional parameter space – meaning most directions in parameter space are redundant) despite having billions of parameters. LoRA operationalizes this insight by hypothesizing that the change needed for task adaptation is similarly low-dimensional, and enforcing this through explicit low-rank matrix decomposition.