Lost in the Middle: How Language Models Use Long Contexts

Authors: Nelson F. Liu, Kevin Lin, John Hewitt et al. Year: 2023 Source: 2307.03172

One-Sentence Summary

Language models perform best when the information they need appears at the very beginning or very end of their input, and they struggle significantly when critical information is buried in the middle – even when the model technically supports long contexts.

Problem Statement

By 2023, language models had gained the ability to accept increasingly long input contexts – 4K tokens, 16K tokens, even 100K tokens. Hardware improvements and algorithmic advances like FlashAttention (a memory-efficient attention algorithm that avoids materializing the full attention matrix) made it feasible to build models that could technically process all these tokens. But a critical question remained unanswered: just because a model can accept 100,000 tokens, does it actually use them effectively?

This question matters enormously for real-world applications. Consider a search-augmented chatbot (like Bing Chat) that retrieves 20 Wikipedia passages to answer a user’s question. If the model can only reliably use information from the first and last few passages, then most of those retrieved documents are wasted – or worse, the extra context might actually hurt performance by diluting the signal with noise.

Prior work had shown that smaller language models (LSTMs with under 1 billion parameters) tended to favor recent tokens, essentially ignoring the beginning of long inputs. But no one had systematically studied how modern large Transformer-based models – the ones powering production systems – handle information at different positions within their full context windows.

Key Innovation

Think of it like reading a long grocery list aloud once, then being asked to recall specific items. Psychologists have known since the 1960s that people reliably remember the first few items (primacy effect) and the last few items (recency effect), but struggle with items in the middle. This paper demonstrates that language models exhibit a strikingly similar pattern, which the authors call the “U-shaped performance curve.”

The key contribution is not a new model or technique, but rather a rigorous experimental methodology that reveals a fundamental weakness in how all tested language models process long contexts. The authors design two controlled tasks – multi-document question answering and synthetic key-value retrieval – where they can precisely control (1) the total length of the input context and (2) the exact position of the one piece of relevant information within that context. By systematically varying these two factors across multiple state-of-the-art models, they produce clear evidence that current language models do not uniformly attend to all positions in their input.

The paper also investigates why this happens, examining three factors: model architecture (decoder-only versus encoder-decoder), query-aware contextualization (placing the question before and after the documents), and instruction fine-tuning. These investigations reveal that the problem is deeply rooted in how models are trained and structured, not simply a side effect of post-training.

Architecture / Method

U-shaped performance curve: model accuracy is highest when relevant information is at the beginning or end of the input, and lowest when it is in the middle

Figure 1: The paper’s central finding. When the relevant document is placed at the beginning (position 1) or end (position 20) of the input context, model accuracy is highest. When placed in the middle, accuracy drops significantly – forming a U-shaped curve. This pattern holds across models and context lengths.

The paper’s experimental method is built around two tasks that serve as controlled testbeds for measuring how models use different positions in their input context.

Task 1: Multi-Document Question Answering. The model receives a question and \(k\) documents (Wikipedia passages of up to 100 tokens each). Exactly one document contains the answer; the remaining \(k - 1\) are “distractor” documents retrieved by Contriever (a dense retrieval system – one that encodes queries and documents as numerical vectors and finds matches by vector similarity – fine-tuned on MS-MARCO, a large-scale passage ranking dataset from Microsoft) that are topically relevant but do not contain the answer. The key experimental variable is the position of the answer-containing document within the list. By sliding the answer document from position 1 to position \(k\), the authors measure how accuracy changes as a function of where the model must look to find the answer. They test with 10, 20, and 30 total documents (roughly 2K, 4K, and 6K tokens). The dataset comes from NaturalQuestions-Open, using 2,655 queries where the annotated answer is a Wikipedia paragraph.

Task 2: Synthetic Key-Value Retrieval. To isolate the retrieval ability from language understanding, the authors create a minimal task: the model receives a JSON object containing \(k\) key-value pairs (all keys and values are random 128-bit UUIDs) and must return the value for a specified key. This task strips away all natural language semantics – the model just needs to find and copy a matching string. They test with 75, 140, and 300 key-value pairs (roughly 4K, 8K, and 16K tokens).

Evaluation. For both tasks, accuracy is measured by checking whether any of the correct answers appear in the model’s generated output. The authors evaluate six models: two open (MPT-30B-Instruct, LongChat-13B-16K) and four closed (GPT-3.5-Turbo, GPT-3.5-Turbo-16K, Claude-1.3, Claude-1.3-100K). They also study encoder-decoder models (Flan-T5-XXL, Flan-UL2) and varying sizes of Llama-2 (7B, 13B, 70B) in follow-up analyses.

Three investigation axes. To understand the source of the U-shaped curve, the authors run three ablations: (1) comparing decoder-only models against encoder-decoder models to test whether bidirectional attention helps, (2) placing the query both before and after the documents (query-aware contextualization) to simulate bidirectional attention in decoder-only models, and (3) comparing instruction-tuned models against their base counterparts to determine whether fine-tuning causes the positional bias.

Mathematical Foundations

This paper is a purely empirical study with no formal mathematical equations in its source. Its contribution lies in experimental design and quantitative analysis rather than new formulations. However, the experimental framework relies on several quantitative concepts worth defining precisely.

Accuracy metric. For both tasks, the paper defines accuracy as:

\[\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\text{answer}_i \in \text{output}_i]\]

where \(N\) is the number of evaluation examples (2,655 for multi-document QA, 500 for key-value retrieval), \(\text{answer}_i\) is any of the gold-standard correct answers for example \(i\), \(\text{output}_i\) is the model’s generated text, and \(\mathbf{1}[\cdot]\) is the indicator function that returns 1 if the condition is true and 0 otherwise. The \(\in\) check is a substring match – the answer must appear somewhere in the generated output. This metric captures whether the model successfully extracted and used the relevant information, regardless of how it phrased its response. This matters because it isolates the model’s ability to locate and use information from its ability to generate fluent text.

Position-conditioned accuracy. The central analysis conditions accuracy on the position \(p\) of the relevant information:

\[\text{Accuracy}(p) = \frac{1}{|\{i : \text{pos}(r_i) = p\}|} \sum_{i : \text{pos}(r_i) = p} \mathbf{1}[\text{answer}_i \in \text{output}_i]\]

where \(\text{pos}(r_i)\) is the position of the relevant document (or key-value pair) in example \(i\)’s input context, and \(p\) ranges from 1 to \(k\) (the total number of documents or key-value pairs). If a model uses its entire context uniformly, \(\text{Accuracy}(p)\) should be roughly constant across all positions. The U-shaped finding is that \(\text{Accuracy}(p)\) is high when \(p\) is near 1 or \(k\), and significantly lower for intermediate values of \(p\). This is the paper’s central measurement – plotting \(\text{Accuracy}(p)\) against \(p\) for each model reveals the severity of positional bias.

Retriever recall. In the open-domain QA case study, the paper tracks retriever recall alongside reader accuracy:

\[\text{Recall}@k = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\left[\text{answer}_i \in \bigcup_{j=1}^{k} d_{ij}\right]\]

where \(d_{ij}\) is the \(j\)-th retrieved document for query \(i\), and the union checks whether the answer appears in any of the top-\(k\) retrieved documents. This matters because the gap between \(\text{Recall}@k\) and reader accuracy quantifies wasted retrieval effort. When the retriever recalls the answer but the reader fails to extract it, the bottleneck is the model’s context usage, not the retrieval system.

As a concrete example: with \(k = 20\) retrieved documents, Contriever achieves approximately 71% recall, and GPT-3.5-Turbo achieves approximately 63% reader accuracy. Increasing to \(k = 50\) pushes recall to about 78%, but reader accuracy only reaches approximately 64.5%. That 7-point recall gain translates to barely 1.5 points of reader accuracy – a direct consequence of the model’s inability to use information from across its full context.

Results

The main result across all models and both tasks is the U-shaped performance curve. When the answer-containing document is placed at position 1 (beginning of context) or position \(k\) (end of context), models perform well. When it is placed in the middle, accuracy drops sharply.

The table below shows GPT-3.5-Turbo multi-document QA accuracy at selected positions (20 total documents, approximately 4K tokens):

Answer Position	Accuracy
1st (beginning)	75.8%
10th (middle)	53.8%
20th (end)	63.2%
Closed-book (no docs)	56.1%
Oracle (1 doc only)	88.3%

The middle-position accuracy (53.8%) is actually worse than the closed-book setting (56.1%), meaning the model would have performed better with no documents at all than with the answer buried in the middle of 20 documents. This demonstrates that adding more context can actively harm performance.

Extended-context models show no advantage. GPT-3.5-Turbo (4K context) and GPT-3.5-Turbo-16K perform nearly identically on inputs that fit within the smaller model’s context window. The same holds for Claude-1.3 (8K) and Claude-1.3 (100K). Simply expanding the context window does not improve a model’s ability to use information at different positions.

Encoder-decoder models resist the bias – within their training-time context length. Flan-UL2 shows only a 1.9% accuracy gap between best and worst positions on 10-document inputs (within its 2,048-token training window). But when pushed to longer sequences, it develops the same U-shaped curve as decoder-only models.

Query-aware contextualization fixes retrieval but not reasoning. Placing the query before and after the documents enables near-perfect key-value retrieval (GPT-3.5-Turbo-16K jumps from a worst-case 45.6% to 100% on 300 key-value pairs). But for multi-document QA, which requires reasoning beyond simple matching, the improvement is negligible.

Model scale matters for primacy. Among Llama-2 models, the 7B variant shows only recency bias (favoring the end of context). The 13B and 70B variants develop the full U-shape with both primacy and recency bias. This suggests that primacy bias emerges with scale, possibly because larger models have more capacity to store patterns from diverse pretraining data (such as StackOverflow posts where important information appears at the beginning).

More retrieved documents plateau quickly. In the open-domain QA case study, retriever recall keeps climbing from 20 to 50 documents, but reader accuracy gains only approximately 1-1.5%. The extra documents cost more tokens (more latency, more cost) for almost no benefit.

Limitations

Limited model selection. The study examines models available in mid-2023 (GPT-3.5, Claude 1.3, MPT-30B, LongChat-13B). These models represent a narrow window in the rapidly evolving LLM landscape. Later models (GPT-4-Turbo, Claude 2/3, Llama-3) may exhibit different patterns, and the paper’s conclusions may not generalize to more recent architectures.
Only two tasks tested. Multi-document QA and key-value retrieval are informative but narrow. Other long-context tasks (summarization, multi-hop reasoning, code understanding) might reveal different positional sensitivities. The key-value retrieval task, in particular, has a simple structure that may not capture how models handle information in more naturalistic settings.
No analysis of attention patterns. The paper describes the what (U-shaped performance) but offers only hypotheses about the why. It does not examine attention weights, probe internal representations, or identify which layers or heads are responsible for the positional bias. (The paper does not acknowledge this limitation.)
Fixed prompt format. The paper uses a single prompt template for each task. Different prompt structures (e.g., chain-of-thought prompting, explicit position markers, or iterative reading strategies) might mitigate the positional bias, but these alternatives are not explored.
Binary relevance assumption. The multi-document QA setup places exactly one relevant document among distractors. Real-world scenarios often involve multiple partially relevant documents with varying degrees of usefulness, a setting where the positional bias might interact with relevance signals in more nuanced ways.
No investigation of positional encoding mechanisms. The paper mentions that models use ALiBi (Attention with Linear Biases, which adds a distance-based penalty to attention scores), rotary embeddings, or relative positional embeddings, but does not systematically study whether specific positional encoding schemes cause more or less positional bias. (The paper does not acknowledge this limitation.)

Impact and Legacy

This paper had an outsized impact on how practitioners think about retrieval-augmented generation (RAG) systems and long-context model deployment. The “lost in the middle” finding became one of the most widely cited results in the context engineering space, directly influencing how production systems order retrieved documents in prompts. The practical takeaway – place the most relevant information at the beginning or end of the context, not the middle – became standard advice in prompt engineering guides.

The paper catalyzed two lines of follow-up research. First, it motivated work on better document reranking strategies for RAG, where retrieved passages are reordered to push relevant information toward positions where models attend most effectively. Second, it spurred research into training methods that reduce positional bias, including modifications to positional encoding schemes, context-length-aware fine-tuning, and attention pattern regularization.

The experimental methodology itself proved influential. The controlled needle-in-a-haystack evaluation paradigm – systematically varying the position and context length to map out a model’s “attention landscape” – became a standard evaluation technique. Model developers now routinely report needle-in-a-haystack results as part of their long-context claims, and the U-shaped curve serves as a benchmark that newer models are expected to flatten.

Prerequisites

To understand this paper, you need:

Basic understanding of Transformer architecture – specifically, the distinction between decoder-only models (which process tokens left-to-right and can only attend to previous tokens) and encoder-decoder models (which have a bidirectional encoder that can attend to all input tokens simultaneously) (see Attention Is All You Need)
Familiarity with retrieval-augmented generation (RAG) – the pattern of retrieving documents from a search index and feeding them to a language model as context to answer questions (see Retrieval-Augmented Generation)
Basic probability and evaluation metrics – accuracy, recall, and what it means to condition on a variable

Connections

Attention Is All You Need (Transformers): The self-attention mechanism introduced by Vaswani et al. is the computational substrate underlying the positional bias discovered here. The Transformer’s attention mechanism can theoretically attend equally to any position, making the observed U-shaped bias a gap between theoretical capability and learned behavior. The paper’s investigation of encoder-decoder vs. decoder-only architectures directly relates to the original Transformer’s encoder-decoder design.
BERT: The finding that encoder-decoder models are more robust to positional changes within their training-time context length connects to BERT’s bidirectional attention design. BERT’s encoder sees all tokens simultaneously, which the authors hypothesize helps with relative importance estimation between documents – the same principle that makes Flan-T5 and Flan-UL2 more position-robust than decoder-only models.
GPT: The decoder-only architecture that GPT pioneered is precisely the architecture class that exhibits the strongest U-shaped bias. The causal attention mask – where each token can only attend to prior tokens – means that documents appearing before the query cannot be contextualized by the query, which the paper identifies as a key factor in the positional sensitivity.
Retrieval-Augmented Generation (RAG): This paper directly evaluates the RAG paradigm, showing its practical limitations. The finding that reader performance saturates long before retriever recall has immediate implications for RAG system design: retrieving more documents past a threshold (roughly 20) wastes compute without improving answers. The paper suggests reranking and truncation as practical mitigations.
LoRA and PEFT: The paper’s analysis of instruction fine-tuning versus base models is relevant to parameter-efficient fine-tuning methods. The finding that the U-shaped curve persists in base models (before instruction tuning) suggests that positional bias is baked in during pretraining, which means fine-tuning approaches – whether full or parameter-efficient – may not resolve it without explicit attention to positional robustness during the fine-tuning data construction.