Scaling Laws for Neural Language Models

Authors: Jared Kaplan, Sam McCandlish, Tom Henighan et al. Year: 2020 Source: 2001.08361

One-Sentence Summary

The performance of language models follows simple, predictable power-law relationships with model size, dataset size, and training compute, and the most efficient way to spend a larger compute budget is to train a bigger model rather than to train longer on more data.

Problem Statement

Before this paper, researchers building language models faced a practical question with no principled answer: given a fixed budget of compute (GPU hours, money, electricity), how should you allocate it? Should you train a small model for a long time on lots of data, or a large model for a short time on less data? Should you make the model deep or wide? How much data do you need?

These questions mattered enormously because training large neural networks is expensive. A single training run can cost millions of dollars in compute. Without guidance, teams were forced to rely on intuition, rules of thumb, and expensive trial-and-error. Earlier work had observed some relationships between model size and performance, but nobody had systematically mapped out how performance depends on all the key variables – model size, dataset size, compute, and architecture shape – simultaneously and across many orders of magnitude.

The situation was made worse by a common assumption: that you should train your model until it converges (meaning the loss stops decreasing). This paper overturns that assumption entirely.

Key Innovation

Think of baking a cake. You might assume the perfect cake requires getting the oven temperature, baking time, and ingredient quantities all exactly right – and that the relationship between these variables is complicated and unpredictable. This paper discovers that for language models, the relationship between the “ingredients” (model size, data, compute) and the “quality” (loss) is surprisingly simple: each follows a power law, which is just a straight line on a log-log plot. Even better, these power laws tell you the optimal recipe for any budget.

The core discovery is that language model performance (measured by cross-entropy loss, a number that goes down as the model gets better at predicting the next word) scales as a power law with three factors: the number of parameters \(N\) (excluding embeddings), the dataset size \(D\) in tokens, and the compute budget \(C\). These trends hold across more than seven orders of magnitude – from tiny models with a thousand parameters to large models with a billion – and the specific architectural details (how deep the network is, how wide, how many attention heads) barely matter, as long as the total parameter count stays the same.

The most surprising and practically important finding: when you have a fixed compute budget, the best strategy is to train a very large model and stop well before convergence. This contradicts the standard practice of training smaller models to convergence. In the optimal allocation, most of the compute increase goes to making the model bigger, with only a tiny increase in training steps. The paper summarizes this as “big models may be more important than big data.”

Architecture / Method

Power law relationships between test loss and compute, dataset size, and model parameters

Figure 1: Language modeling performance (test loss) improves as a smooth power law when scaling compute, dataset size, or model parameters independently. Each relationship follows \(L \propto x^{-\alpha}\) over more than six orders of magnitude. These trends are the paper’s central empirical finding.

This paper is not proposing a new architecture. Instead, it studies the Transformer architecture (see Attention Is All You Need) in its decoder-only form, the same kind used in GPT (see Improving Language Understanding by Generative Pre-Training). The models are trained to predict the next token (next-word prediction) on WebText2, a dataset of about 23 billion tokens scraped from Reddit-linked web pages.

The authors define model size \(N\) as the number of non-embedding parameters. For a standard Transformer with \(n_{\text{layer}}\) layers and model dimension \(d_{\text{model}}\), this is approximately:

\[N \approx 12 \, n_{\text{layer}} \, d_{\text{model}}^2\]

They exclude the embedding matrix (which maps vocabulary tokens to vectors) and positional embeddings from this count because including them obscures the scaling trends – a crucial methodological choice that makes the power laws much cleaner.

Training compute is estimated as \(C \approx 6NBS\) floating-point operations, where \(B\) is the batch size in tokens and \(S\) is the number of training steps. The factor of 6 accounts for both the forward pass (2 multiply-accumulate operations per parameter per token) and the backward pass (roughly twice the forward pass). Numerical values are quoted in PF-days (petaflop-days), where 1 PF-day equals \(8.64 \times 10^{19}\) floating-point operations.

The authors train hundreds of models spanning the range from 768 to 1.5 billion non-embedding parameters, on datasets from 22 million to 23 billion tokens. They vary depth, width, attention heads, and feed-forward dimensions independently while holding total parameter count fixed, demonstrating that performance depends on \(N\), not on how those parameters are distributed across the architecture. For instance, a model with shape \((n_{\text{layer}}, d_{\text{model}}) = (6, 4288)\) – very wide and shallow – reaches a loss within 3% of a \((48, 1600)\) model – much deeper and narrower – when both have the same total \(N\).

A key methodological tool is the concept of the critical batch size (denoted \(B_{\text{crit}}\)), borrowed from earlier work on gradient noise. The critical batch size is the point where increasing the batch size starts giving diminishing returns: below \(B_{\text{crit}}\), bigger batches use compute efficiently; above it, bigger batches waste compute but reduce wall-clock time. The authors find that \(B_{\text{crit}}\) depends only on the current loss value, not on the model size directly, and follows its own power law:

\[B_{\text{crit}}(L) = \frac{B_*}{L^{1/\alpha_B}}\]

where \(B_* \approx 2 \times 10^8\) tokens and \(\alpha_B \approx 0.21\).

Mathematical Foundations

Power Law for Model Size

\[L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\]

where \(L\) is the cross-entropy loss in nats (natural units of information), \(N\) is the number of non-embedding parameters, \(N_c \sim 8.8 \times 10^{13}\) is a scale constant, and \(\alpha_N \sim 0.076\) is the power-law exponent.

In plain language: if you double the number of parameters, the loss decreases by a factor of \(2^{-0.076} \approx 0.95\), which means about a 5% reduction. The relationship is a straight line on a log-log plot. This holds across six orders of magnitude in \(N\) and applies when the model is trained to convergence on enough data that overfitting is not an issue.

This matters because it gives a precise, quantitative answer to “how much does making the model bigger help?” The answer is: predictably, but with diminishing returns (since \(\alpha_N\) is small).

For a worked example: consider a model with \(N = 10^7\) parameters (10 million). The loss is \(L = (8.8 \times 10^{13} / 10^7)^{0.076} = (8.8 \times 10^{6})^{0.076}\). Taking the log: \(0.076 \times \ln(8.8 \times 10^6) \approx 0.076 \times 16.0 \approx 1.22\), so \(L \approx e^{1.22} \approx 3.39\) nats. If we scale up to \(N = 10^9\) (1 billion parameters), we get \(L = (8.8 \times 10^{13} / 10^9)^{0.076} = (8.8 \times 10^4)^{0.076} \approx e^{0.076 \times 11.4} \approx e^{0.867} \approx 2.38\) nats. A 100x increase in parameters yields a loss reduction from 3.39 to 2.38 – meaningful but diminishing.

Power Law for Dataset Size

\[L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}\]

where \(D\) is the dataset size in tokens, \(D_c \sim 5.4 \times 10^{13}\) is a scale constant, and \(\alpha_D \sim 0.095\) is the power-law exponent.

In plain language: doubling the dataset reduces the loss by a factor of \(2^{-0.095} \approx 0.94\), about a 6% reduction. This holds when the model is large enough that it is not the bottleneck.

This matters because it quantifies the return on investment from collecting more data. Combined with the model-size law, it tells you that scaling up parameters gives slightly less return per doubling than scaling up data – but as we will see in the compute allocation result, parameters are still the better investment because of how the compute budget splits.

Combined Scaling Law for Model and Data

\[L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}\]

where all symbols are as defined above: \(N\) is the non-embedding parameter count, \(D\) is the dataset size in tokens, \(N_c\) and \(D_c\) are scale constants, and \(\alpha_N\) and \(\alpha_D\) are the respective power-law exponents.

In plain language: this single equation captures what happens when both model size and dataset size are finite simultaneously. When \(D\) is very large (effectively infinite), the \(D_c/D\) term vanishes and the equation reduces to \(L(N)\). When \(N\) is very large, the \((N_c/N)\) term vanishes and it reduces to \(L(D)\). In between, it captures overfitting: if you make the model much bigger without adding more data, performance stops improving.

A key implication is the overfitting threshold. To avoid meaningful overfitting, the dataset should scale as \(D \gtrsim 5000 \cdot N^{0.74}\). For example, a model with \(N = 10^9\) parameters (1 billion) needs roughly \(D \approx 5000 \times (10^9)^{0.74} \approx 5000 \times 1.3 \times 10^{6.66} \approx 2.3 \times 10^{10}\) tokens. The full WebText2 dataset at 22 billion tokens just barely suffices. Critically, data requirements grow sub-linearly with model size: doubling \(N\) requires only about \(2^{0.74} \approx 1.67\) times as much data, not twice as much.

Learning Curve Scaling Law

\[L(N, S_{\min}) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{S_c}{S_{\min}}\right)^{\alpha_S}\]

where \(S_{\min}\) is the minimum number of training steps (at optimal batch size) to reach a given loss, \(S_c \approx 2.1 \times 10^3\) is a scale constant, and \(\alpha_S \approx 0.76\) is the power-law exponent for training steps.

In plain language: the loss has two additive components – one from the model being too small (the \((N_c/N)\) term), and one from not training long enough (the \((S_c/S_{\min})\) term). Each is a power law. After an initial transient period at the start of training, this equation accurately fits the learning curves for all model sizes.

This matters because it separates the two sources of imperfect performance and lets you predict how long training will take to reach a target loss for a given model size. It also implies that for compute-efficient training, you should stop at the point where the loss is about \(\alpha_N / \alpha_S \approx 10\%\) above the fully converged loss – training further wastes compute that would be better spent on a bigger model.

Compute-Optimal Exponent

\[\alpha_C^{\min} = \frac{1}{1/\alpha_S + 1/\alpha_B + 1/\alpha_N}\]

where \(\alpha_S \approx 0.76\) is the training-step exponent, \(\alpha_B \approx 0.21\) is the batch-size exponent, and \(\alpha_N \approx 0.076\) is the model-size exponent. Plugging in: \(\alpha_C^{\min} \approx 1/(1.32 + 4.76 + 13.16) \approx 1/19.24 \approx 0.052\).

In plain language: this harmonic-mean-like combination of the three exponents determines how the optimal loss scales with total compute. Since \(\alpha_N\) is the smallest of the three, it dominates the denominator, which means the optimal model size scaling \(N \propto C^{\alpha_C^{\min}/\alpha_N} \propto C^{0.71}\) is steep – most of the compute should go into bigger models. Meanwhile, the number of training steps scales as \(S \propto C^{\alpha_C^{\min}/\alpha_S} \propto C^{0.07}\), barely growing at all.

This matters because it provides the recipe for compute-optimal training: given 10 times more compute, use about 5 times more parameters, roughly 1.8 times more data, and barely any more training steps. This is the paper’s most practically important result.

Results

The paper’s empirical results span a remarkable range. Consider a concrete allocation example. Suppose you have a compute budget of 1 PF-day (\(8.64 \times 10^{19}\) FLOPs). According to the fitted scaling laws, the compute-optimal choices would be approximately: model size \(N_{\text{opt}} \approx 1.3 \times 10^9 \times (1)^{0.73} = 1.3 \times 10^9\) parameters, dataset size \(D_{\text{opt}} \approx 2 \times 10^{10} \times (1)^{0.27} = 2 \times 10^{10}\) tokens, and training steps \(S_{\min} \approx 5.4 \times 10^3 \times (1)^{0.03} = 5400\) steps. Now suppose you get 10 times more compute (10 PF-days). The optimal model grows to \(N \approx 1.3 \times 10^9 \times 10^{0.73} \approx 7.0 \times 10^9\) (5.4x larger), while training steps barely change: \(S \approx 5400 \times 10^{0.03} \approx 5800\) (1.07x). Most of the new compute goes to model size.

Finding	Quantitative Result
Model size scaling	\(L \propto N^{-0.076}\), spanning 6 orders of magnitude in \(N\)
Dataset size scaling	\(L \propto D^{-0.095}\), spanning 2+ orders of magnitude in \(D\)
Compute-optimal scaling	\(L \propto C_{\min}^{-0.050}\), spanning 8 orders of magnitude in \(C\)
Optimal model size vs compute	\(N_{\text{opt}} \propto C_{\min}^{0.73}\)
Optimal training steps vs compute	\(S_{\min} \propto C_{\min}^{0.03}\) (nearly constant)
Data needed to avoid overfitting	\(D \gtrsim 5000 \cdot N^{0.74}\) (sub-linear in model size)
Transfer performance	Constant offset from training distribution, same scaling slope
Architecture shape dependence	Loss varies less than 3% across a 40x range of depth/width ratios

The transfer results are especially notable: models trained on WebText2 show the same power-law improvement on Books Corpus, Common Crawl, Wikipedia, and Internet Books, with only a constant additive offset in loss. This means the scaling laws predict generalization performance, not just training-distribution performance.

The comparison with LSTMs (Long Short-Term Memory networks, an older recurrent architecture) is also striking. LSTMs match Transformer performance on early tokens in the context but fall behind on later tokens, suggesting that the Transformer’s self-attention mechanism (which can directly access any previous token) is specifically better at using long-range context compared to the LSTM’s sequential hidden state.

Limitations

No theoretical explanation. The paper acknowledges that the power laws are purely empirical. There is no derivation from first principles explaining why these specific exponents arise. Without a theory, it is unclear when or how the scaling laws might break down.
Single training distribution. All experiments use WebText2 with BPE tokenization. The specific numerical constants (\(N_c\), \(D_c\), etc.) depend on the tokenization and data distribution, and the exponents might differ for fundamentally different domains (code, multilingual text, scientific text).
No regularization tuning. The authors used a fixed 10% dropout across all experiments and did not systematically vary regularization with model or dataset size. Better regularization could change the overfitting scaling.
Compute estimate ignores context-dependent terms. The \(C \approx 6NBS\) estimate drops the attention computation that scales with context length \(n_{\text{ctx}}\), which becomes significant when \(n_{\text{ctx}} \gtrsim 12 \, d_{\text{model}}\). For modern long-context models, this is a meaningful omission.
Exponents later revised by Chinchilla. The 2022 Chinchilla paper (Hoffmann et al.) found that compute-optimal training requires significantly more data than this paper suggests, with data and parameters scaling roughly equally with compute rather than parameters dominating. This indicates the scaling exponents here may be biased by the specific experimental setup (limited dataset size range, specific learning rate schedules, training at a fixed batch size rather than at the critical batch size).
Narrow task metric. Cross-entropy loss on next-token prediction is the only metric. The paper does not investigate how scaling affects downstream task performance, reasoning ability, or other capabilities that practitioners care about.
Small data regime poorly fit. The combined scaling law \(L(N, D)\) fits poorly when the dataset is reduced to about \(2 \times 10^7\) tokens, suggesting the scaling laws may not hold in low-data regimes.

Impact and Legacy

This paper transformed how the AI research community thinks about training large language models. Before it, scaling decisions were largely ad hoc. Afterward, labs began treating scaling as a predictable, optimizable process. The paper’s central message – that bigger models are more sample-efficient and that compute should primarily go toward model size – directly influenced the design of GPT-3 (175 billion parameters, released the same year by the same research group at OpenAI) and subsequent large language models.

The concept of “scaling laws” became a standard tool for AI labs planning training runs worth millions of dollars. Teams at Google DeepMind, Anthropic, Meta, and others adopted the methodology of fitting power laws to smaller runs and extrapolating to predict large-run performance. This enabled principled decisions about hardware purchases and training schedules months in advance.

However, the specific numerical recommendations were later revised. The 2022 Chinchilla paper by Hoffmann et al. at DeepMind found that compute-optimal training actually requires much more data relative to parameters than Kaplan et al. suggested. While Kaplan et al. predicted \(N \propto C^{0.73}\) and \(D \propto C^{0.27}\) (parameters should grow much faster than data), Chinchilla found roughly equal scaling for both, leading to smaller but better-trained models at the same compute budget. This revision did not invalidate the scaling law methodology – it refined the exponents, and demonstrated exactly the kind of distribution-dependent variation the original paper cautioned about. The broader insight that performance scales predictably with compute, and that you can plan large training runs from smaller experiments, remains foundational to the field.

Prerequisites

To follow this paper, a reader should understand:

Cross-entropy loss: the standard loss function for next-token prediction. It measures how surprised the model is by the actual next token, in nats (natural logarithm units). Lower is better.
Power laws: relationships of the form \(y = ax^b\), which appear as straight lines on log-log plots. The exponent \(b\) tells you how fast \(y\) changes with \(x\).
The Transformer architecture, including self-attention, feed-forward layers, and the distinction between encoder and decoder (see Attention Is All You Need).
Decoder-only language models that predict the next token given all previous tokens (see Improving Language Understanding by Generative Pre-Training).
Basic optimization concepts: training steps, batch size, learning rate schedules, convergence, and early stopping.

Connections

Transformer architecture (see Attention Is All You Need): This paper studies exactly that architecture in its decoder-only variant. It confirms that the Transformer’s performance scales predictably with size, and discovers that architectural details (depth vs. width, number of attention heads) matter far less than total parameter count.
GPT (see Improving Language Understanding by Generative Pre-Training): The models in this paper are essentially GPT-style decoder-only Transformers trained on WebText2, the successor to GPT’s WebText dataset. The scaling laws discovered here provided the quantitative justification for scaling up to GPT-3.
BERT (see BERT: Pre-training of Deep Bidirectional Transformers): While BERT uses a bidirectional encoder architecture and this paper studies unidirectional decoders, both demonstrate that larger pre-trained language models perform better. This paper goes further by precisely quantifying the relationship and discovering that model shape (the choice BERT makes to be an encoder) matters much less than model scale.
GANs (see Generative Adversarial Nets) and VAEs (see Auto-Encoding Variational Bayes): These are other generative modeling frameworks. The paper conjectures that similar scaling laws may hold for other generative models with maximum likelihood losses, potentially including VAEs. GANs, which use a different adversarial training objective, would be a less direct extension.