Lost in the Middle: How Language Models Use Long Contexts

Authors: Nelson F. Liu, Kevin Lin, John Hewitt et al. Year: 2023 Source: 2307.03172

One-Sentence Summary

Language models perform best when the information they need appears at the very beginning or very end of their input, and they struggle significantly when critical information is buried in the middle – even when the model technically supports long contexts.

Problem Statement

By 2023, language models had gained the ability to accept increasingly long input contexts – 4K tokens, 16K tokens, even 100K tokens. Hardware improvements and algorithmic advances like FlashAttention (a memory-efficient attention algorithm that avoids materializing the full attention matrix) made it feasible to build models that could technically process all these tokens. But a critical question remained unanswered: just because a model can accept 100,000 tokens, does it actually use them effectively?

This question matters enormously for real-world applications. Consider a search-augmented chatbot (like Bing Chat) that retrieves 20 Wikipedia passages to answer a user’s question. If the model can only reliably use information from the first and last few passages, then most of those retrieved documents are wasted – or worse, the extra context might actually hurt performance by diluting the signal with noise.

Prior work had shown that smaller language models (LSTMs with under 1 billion parameters) tended to favor recent tokens, essentially ignoring the beginning of long inputs. But no one had systematically studied how modern large Transformer-based models – the ones powering production systems – handle information at different positions within their full context windows.

Key Innovation

Think of it like reading a long grocery list aloud once, then being asked to recall specific items. Psychologists have known since the 1960s that people reliably remember the first few items (primacy effect) and the last few items (recency effect), but struggle with items in the middle. This paper demonstrates that language models exhibit a strikingly similar pattern, which the authors call the “U-shaped performance curve.”

The key contribution is not a new model or technique, but rather a rigorous experimental methodology that reveals a fundamental weakness in how all tested language models process long contexts. The authors design two controlled tasks – multi-document question answering and synthetic key-value retrieval – where they can precisely control (1) the total length of the input context and (2) the exact position of the one piece of relevant information within that context. By systematically varying these two factors across multiple state-of-the-art models, they produce clear evidence that current language models do not uniformly attend to all positions in their input.

The paper also investigates why this happens, examining three factors: model architecture (decoder-only versus encoder-decoder), query-aware contextualization (placing the question before and after the documents), and instruction fine-tuning. These investigations reveal that the problem is deeply rooted in how models are trained and structured, not simply a side effect of post-training.

Architecture / Method

U-shaped performance curve: model accuracy is highest when relevant information is at the beginning or end of the input, and lowest when it is in the middle

Figure 1: The paper’s central finding. When the relevant document is placed at the beginning (position 1) or end (position 20) of the input context, model accuracy is highest. When placed in the middle, accuracy drops significantly – forming a U-shaped curve. This pattern holds across models and context lengths.

The paper’s experimental method is built around two tasks that serve as controlled testbeds for measuring how models use different positions in their input context.

Task 1: Multi-Document Question Answering. The model receives a question and \(k\) documents (Wikipedia passages of up to 100 tokens each). Exactly one document contains the answer; the remaining \(k - 1\) are “distractor” documents retrieved by Contriever (a dense retrieval system – one that encodes queries and documents as numerical vectors and finds matches by vector similarity – fine-tuned on MS-MARCO, a large-scale passage ranking dataset from Microsoft) that are topically relevant but do not contain the answer. The key experimental variable is the position of the answer-containing document within the list. By sliding the answer document from position 1 to position \(k\), the authors measure how accuracy changes as a function of where the model must look to find the answer. They test with 10, 20, and 30 total documents (roughly 2K, 4K, and 6K tokens). The dataset comes from NaturalQuestions-Open, using 2,655 queries where the annotated answer is a Wikipedia paragraph.

Task 2: Synthetic Key-Value Retrieval. To isolate the retrieval ability from language understanding, the authors create a minimal task: the model receives a JSON object containing \(k\) key-value pairs (all keys and values are random 128-bit UUIDs) and must return the value for a specified key. This task strips away all natural language semantics – the model just needs to find and copy a matching string. They test with 75, 140, and 300 key-value pairs (roughly 4K, 8K, and 16K tokens).

Evaluation. For both tasks, accuracy is measured by checking whether any of the correct answers appear in the model’s generated output. The authors evaluate six models: two open (MPT-30B-Instruct, LongChat-13B-16K) and four closed (GPT-3.5-Turbo, GPT-3.5-Turbo-16K, Claude-1.3, Claude-1.3-100K). They also study encoder-decoder models (Flan-T5-XXL, Flan-UL2) and varying sizes of Llama-2 (7B, 13B, 70B) in follow-up analyses.

Three investigation axes. To understand the source of the U-shaped curve, the authors run three ablations: (1) comparing decoder-only models against encoder-decoder models to test whether bidirectional attention helps, (2) placing the query both before and after the documents (query-aware contextualization) to simulate bidirectional attention in decoder-only models, and (3) comparing instruction-tuned models against their base counterparts to determine whether fine-tuning causes the positional bias.

Mathematical Foundations

This paper is a purely empirical study with no formal mathematical equations in its source. Its contribution lies in experimental design and quantitative analysis rather than new formulations. However, the experimental framework relies on several quantitative concepts worth defining precisely.

Accuracy metric. For both tasks, the paper defines accuracy as:

\[\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\text{answer}_i \in \text{output}_i]\]

where \(N\) is the number of evaluation examples (2,655 for multi-document QA, 500 for key-value retrieval), \(\text{answer}_i\) is any of the gold-standard correct answers for example \(i\), \(\text{output}_i\) is the model’s generated text, and \(\mathbf{1}[\cdot]\) is the indicator function that returns 1 if the condition is true and 0 otherwise. The \(\in\) check is a substring match – the answer must appear somewhere in the generated output. This metric captures whether the model successfully extracted and used the relevant information, regardless of how it phrased its response. This matters because it isolates the model’s ability to locate and use information from its ability to generate fluent text.

Position-conditioned accuracy. The central analysis conditions accuracy on the position \(p\) of the relevant information:

\[\text{Accuracy}(p) = \frac{1}{|\{i : \text{pos}(r_i) = p\}|} \sum_{i : \text{pos}(r_i) = p} \mathbf{1}[\text{answer}_i \in \text{output}_i]\]

where \(\text{pos}(r_i)\) is the position of the relevant document (or key-value pair) in example \(i\)’s input context, and \(p\) ranges from 1 to \(k\) (the total number of documents or key-value pairs). If a model uses its entire context uniformly, \(\text{Accuracy}(p)\) should be roughly constant across all positions. The U-shaped finding is that \(\text{Accuracy}(p)\) is high when \(p\) is near 1 or \(k\), and significantly lower for intermediate values of \(p\). This is the paper’s central measurement – plotting \(\text{Accuracy}(p)\) against \(p\) for each model reveals the severity of positional bias.

Retriever recall. In the open-domain QA case study, the paper tracks retriever recall alongside reader accuracy:

\[\text{Recall}@k = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\left[\text{answer}_i \in \bigcup_{j=1}^{k} d_{ij}\right]\]

where \(d_{ij}\) is the \(j\)-th retrieved document for query \(i\), and the union checks whether the answer appears in any of the top-\(k\) retrieved documents. This matters because the gap between \(\text{Recall}@k\) and reader accuracy quantifies wasted retrieval effort. When the retriever recalls the answer but the reader fails to extract it, the bottleneck is the model’s context usage, not the retrieval system.

As a concrete example: with \(k = 20\) retrieved documents, Contriever achieves approximately 71% recall, and GPT-3.5-Turbo achieves approximately 63% reader accuracy. Increasing to \(k = 50\) pushes recall to about 78%, but reader accuracy only reaches approximately 64.5%. That 7-point recall gain translates to barely 1.5 points of reader accuracy – a direct consequence of the model’s inability to use information from across its full context.

Results

The main result across all models and both tasks is the U-shaped performance curve. When the answer-containing document is placed at position 1 (beginning of context) or position \(k\) (end of context), models perform well. When it is placed in the middle, accuracy drops sharply.

The table below shows GPT-3.5-Turbo multi-document QA accuracy at selected positions (20 total documents, approximately 4K tokens):

Answer Position Accuracy
1st (beginning) 75.8%
10th (middle) 53.8%
20th (end) 63.2%
Closed-book (no docs) 56.1%
Oracle (1 doc only) 88.3%

The middle-position accuracy (53.8%) is actually worse than the closed-book setting (56.1%), meaning the model would have performed better with no documents at all than with the answer buried in the middle of 20 documents. This demonstrates that adding more context can actively harm performance.

Extended-context models show no advantage. GPT-3.5-Turbo (4K context) and GPT-3.5-Turbo-16K perform nearly identically on inputs that fit within the smaller model’s context window. The same holds for Claude-1.3 (8K) and Claude-1.3 (100K). Simply expanding the context window does not improve a model’s ability to use information at different positions.

Encoder-decoder models resist the bias – within their training-time context length. Flan-UL2 shows only a 1.9% accuracy gap between best and worst positions on 10-document inputs (within its 2,048-token training window). But when pushed to longer sequences, it develops the same U-shaped curve as decoder-only models.

Query-aware contextualization fixes retrieval but not reasoning. Placing the query before and after the documents enables near-perfect key-value retrieval (GPT-3.5-Turbo-16K jumps from a worst-case 45.6% to 100% on 300 key-value pairs). But for multi-document QA, which requires reasoning beyond simple matching, the improvement is negligible.

Model scale matters for primacy. Among Llama-2 models, the 7B variant shows only recency bias (favoring the end of context). The 13B and 70B variants develop the full U-shape with both primacy and recency bias. This suggests that primacy bias emerges with scale, possibly because larger models have more capacity to store patterns from diverse pretraining data (such as StackOverflow posts where important information appears at the beginning).

More retrieved documents plateau quickly. In the open-domain QA case study, retriever recall keeps climbing from 20 to 50 documents, but reader accuracy gains only approximately 1-1.5%. The extra documents cost more tokens (more latency, more cost) for almost no benefit.

Limitations

Impact and Legacy

This paper had an outsized impact on how practitioners think about retrieval-augmented generation (RAG) systems and long-context model deployment. The “lost in the middle” finding became one of the most widely cited results in the context engineering space, directly influencing how production systems order retrieved documents in prompts. The practical takeaway – place the most relevant information at the beginning or end of the context, not the middle – became standard advice in prompt engineering guides.

The paper catalyzed two lines of follow-up research. First, it motivated work on better document reranking strategies for RAG, where retrieved passages are reordered to push relevant information toward positions where models attend most effectively. Second, it spurred research into training methods that reduce positional bias, including modifications to positional encoding schemes, context-length-aware fine-tuning, and attention pattern regularization.

The experimental methodology itself proved influential. The controlled needle-in-a-haystack evaluation paradigm – systematically varying the position and context length to map out a model’s “attention landscape” – became a standard evaluation technique. Model developers now routinely report needle-in-a-haystack results as part of their long-context claims, and the U-shaped curve serves as a benchmark that newer models are expected to flatten.

Prerequisites

To understand this paper, you need:

Connections