Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus et al. Year: 2020 Source: arXiv 2005.11401
RAG gives a text-generation model the ability to look up relevant Wikipedia passages before answering, combining a search engine with a language generator so the system can produce responses grounded in real-world facts rather than relying solely on memorized knowledge.
By 2020, large pre-trained language models like GPT-2 and T5 had demonstrated an impressive ability to answer factual questions by drawing on knowledge absorbed during training. The model parameters themselves acted as a kind of implicit database: ask “Who wrote Hamlet?” and the model could produce “William Shakespeare” purely from patterns learned over billions of words. But this parametric knowledge had three serious problems.
First, the knowledge was frozen at training time. If a new president took office or a scientific discovery was published after training, the model had no way to learn about it without expensive retraining. Second, there was no way to inspect where an answer came from. When a model said “The capital of Australia is Sydney” (incorrect – it is Canberra), you could not point to the source it relied on, making it hard to debug or trust. Third, the model could “hallucinate” – confidently generate plausible-sounding but entirely fabricated facts, because its generation process had no mechanism to verify claims against any external source.
Meanwhile, a parallel line of work on retrieval-based models like REALM and ORQA had shown that pairing a language model with a document retriever could help with extractive question answering (where the answer is copied verbatim from a retrieved passage). But these systems were limited to extraction – they could highlight a span in a document but could not synthesize a free-form answer, verify a claim, or generate a Jeopardy question. No one had built a general-purpose system that combined retrieval with open-ended text generation across many different task types.
Think of the difference between a closed-book exam and an open-book exam. In a closed-book exam, you must answer every question from memory alone. If you misremember a fact, you give the wrong answer and have no way to check yourself. In an open-book exam, you can flip through a reference book to find relevant passages before writing your answer. You still need to understand the question and compose a coherent response, but you can ground your answer in actual source material.
RAG is an open-book exam for language models. Given a question (or any input text), RAG first searches a large corpus – in this case, all of Wikipedia split into 21 million 100-word passages – to find the most relevant documents. It then feeds both the original question and the retrieved documents to a text generator (BART, a seq2seq transformer (see Attention Is All You Need)), which produces the final answer. The retriever and generator are trained together end-to-end, meaning the system learns what to look up based on what helps it produce better answers.
The key technical insight is treating the retrieved documents as latent variables – hidden choices that the model considers but that are not directly supervised. The paper proposes two ways to combine evidence from multiple retrieved documents: RAG-Sequence, which picks one document and generates the entire answer from it (then averages across documents), and RAG-Token, which can consult a different document for each word it generates. This latent-variable formulation lets the system learn to retrieve useful documents without anyone having to label which document is “correct” for each question.
RAG has two main components that work together: a retriever that finds relevant documents and a generator that produces text conditioned on those documents.
Figure 1: Overview of the RAG approach. A pre-trained retriever (query encoder + document index) finds the top-K relevant documents using Maximum Inner Product Search (MIPS). These documents are concatenated with the input and passed to a pre-trained seq2seq generator (BART). The retrieved documents are treated as latent variables, marginalized over during generation.
The Retriever (Dense Passage Retriever / DPR). The retriever uses two separate BERT encoders (see BERT). One encoder converts the input query \(x\) into a vector \(q(x)\), and the other pre-computes a vector \(d(z)\) for every document \(z\) in the corpus. To find the most relevant documents for a query, the system computes the dot product between the query vector and every document vector, then selects the top-\(K\) documents with the highest scores. This is a Maximum Inner Product Search (MIPS) problem, solved efficiently using the FAISS library (Facebook AI Similarity Search, a library for fast nearest-neighbor lookup in high-dimensional vector spaces) with approximate nearest-neighbor indexing so that searching 21 million documents takes sub-linear time rather than requiring a brute-force scan.
The Generator (BART). The generator is BART-large, a 400-million-parameter encoder-decoder transformer. For each retrieved document \(z\), the model concatenates the original input \(x\) with the document text \(z\) and feeds this combined input to the encoder. The decoder then generates the output sequence token by token, conditioned on the encoded representation. BART was pre-trained with a denoising objective – learning to reconstruct text that has been corrupted with various noise functions – giving it strong general-purpose generation abilities.
Two Marginalization Strategies. Since the model retrieves \(K\) documents, it needs a way to combine the \(K\) different generation results into a single output distribution. RAG-Sequence treats each document as a complete “hypothesis source”: it generates the full output sequence conditioned on each document separately, then weights each sequence by the retriever’s probability for that document and sums. RAG-Token is more fine-grained: for each output token position, it sums the generation probabilities across all \(K\) documents (weighted by retriever scores), allowing a single output sentence to draw facts from multiple documents. For instance, when generating a Jeopardy clue about Hemingway, the model might consult one document for “The Sun Also Rises” and another for “A Farewell to Arms.”
Training. The training objective is straightforward: minimize the negative log-likelihood of the correct output \(y\) given the input \(x\), where the likelihood is computed by marginalizing over the top-\(K\) retrieved documents. Crucially, there is no supervision on which documents should be retrieved – the system learns this end-to-end through backpropagation. The document encoder and index are kept fixed during training (updating them would require periodically rebuilding the entire 21-million-document index). Only the query encoder and the BART generator parameters are fine-tuned.
Decoding. At test time, RAG-Token decoding is straightforward: the per-token marginalized probabilities can be plugged directly into standard beam search (a decoding strategy that tracks the top-\(b\) most probable partial sequences at each step, expanding all of them before pruning). RAG-Sequence decoding is more complex because the likelihood does not decompose into independent per-token terms. The paper proposes “Thorough Decoding” (run beam search per document, then re-score hypotheses across all documents) and “Fast Decoding” (approximate by assuming zero probability for hypotheses not generated from a given document).
\[p_{\text{RAG-Sequence}}(y|x) \approx \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z|x) \prod_{i}^{N} p_\theta(y_i|x, z, y_{1:i-1})\]
In plain language: for each of the top-\(K\) documents, the generator produces the entire output sequence, and we compute the probability of that sequence. We then take a weighted average across documents, where the weights come from the retriever. This is like asking \(K\) different “experts” (each reading a different document) to each write a complete answer, then blending their responses by how relevant their source document was.
This matters because it lets the model consider multiple possible evidence sources while still producing a single coherent answer. The approximation (using only top-\(K\) instead of all 21 million documents) makes the computation tractable.
\[p_{\text{RAG-Token}}(y|x) \approx \prod_{i}^{N} \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z|x) p_\theta(y_i|x, z, y_{1:i-1})\]
The critical difference from RAG-Sequence is the order of the product and sum. Here, for each token position \(i\), we first compute a weighted average of the generation probabilities across all \(K\) documents, then multiply these per-token marginals together. This means that token \(y_1\) might be most influenced by document 3, while token \(y_5\) might draw primarily from document 1.
This matters because it allows a single generated sentence to synthesize information from multiple documents. For example, consider generating the sentence “The capital of France is Paris, located on the Seine river.” The word “Paris” might be most strongly supported by one Wikipedia article, while “Seine river” might be supported by a different article.
\[p_\eta(z|x) \propto \exp\left(d(z)^\top q(x)\right)\]
\[d(z) = \text{BERT}_d(z), \quad q(x) = \text{BERT}_q(x)\]
In plain language: encode the question and document each as a vector in the same space, compute their dot product as a similarity score, then convert scores to probabilities using the softmax function (exponentiate and normalize). Documents whose vectors point in a similar direction to the query vector get high probability.
To make this concrete, suppose \(q(x) = [0.3, 0.8, 0.1]\) and \(d(z) = [0.4, 0.7, 0.2]\). The dot product is \(0.3 \times 0.4 + 0.8 \times 0.7 + 0.1 \times 0.2 = 0.12 + 0.56 + 0.02 = 0.70\). Then \(\exp(0.70) \approx 2.01\). If another document gives a dot product of 0.3, then \(\exp(0.3) \approx 1.35\), and the first document would receive probability \(2.01 / (2.01 + 1.35) \approx 0.60\).
This matters because it makes retrieval differentiable – the query encoder can be updated through gradient descent to produce better queries, learning to retrieve documents that actually help the generator produce correct answers.
\[\mathcal{L} = \sum_j -\log p(y_j|x_j)\]
In plain language: for each training example, compute how much probability the model assigns to the correct answer (after marginalizing over retrieved documents), take the negative logarithm (so higher probability means lower loss), and sum across all examples. Minimize this sum using stochastic gradient descent with the Adam optimizer.
This matters because it trains the retriever and generator jointly. Gradients flow back through the generator into the query encoder: if retrieving different documents would improve the generator’s output, the query encoder updates to retrieve those documents next time.
\[p'_\theta(y_i|x, y_{1:i-1}) = \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z_i|x) p_\theta(y_i|x, z_i, y_{1:i-1})\]
In plain language: this reformulates RAG-Token as a standard autoregressive model with a modified transition probability. Instead of the generator consulting a single context, it consults \(K\) documents and blends their predictions. Because this has the standard autoregressive form (each token’s probability depends only on previous tokens), standard beam search decoding works directly.
This matters because it shows RAG-Token is compatible with existing efficient decoding algorithms – no special decoding procedure is needed, unlike RAG-Sequence which requires running beam search per document.
RAG achieved state-of-the-art results on four open-domain question answering benchmarks at the time of publication. The table below shows Exact Match scores, where a prediction counts as correct only if it exactly matches the gold answer.
| Model | Natural Questions | TriviaQA | WebQuestions | CuratedTrec |
|---|---|---|---|---|
| T5-11B (closed-book) | 34.5 | 50.1 | 37.4 | - |
| T5-11B+SSM (closed-book) | 36.6 | 60.5 | 44.7 | - |
| REALM (open-book) | 40.4 | - | 40.7 | 46.8 |
| DPR (open-book) | 41.5 | 57.9 | 41.1 | 50.6 |
| RAG-Token | 44.1 | 55.2 | 45.5 | 50.0 |
| RAG-Sequence | 44.5 | 56.8 | 45.2 | 52.2 |
The numbers tell a striking story. RAG-Sequence beat the closed-book T5-11B model on Natural Questions by 10 points (44.5 vs 34.5) despite having only 626 million trainable parameters compared to T5’s 11 billion – roughly 18 times fewer parameters. This demonstrated that giving a model access to external knowledge through retrieval can be far more parameter-efficient than trying to memorize everything. RAG also outperformed DPR, a dedicated extractive QA system that used a BERT cross-encoder re-ranker (a model that reads the query and document together to judge relevance, more accurate but slower than the bi-encoder retriever) and an extractive reader, showing that a general-purpose generative approach could surpass specialized extraction pipelines. Notably, RAG generated correct answers 11.8% of the time on Natural Questions even when the correct answer did not appear verbatim in any retrieved document – something impossible for any extractive system.
For text generation tasks, RAG showed clear qualitative improvements over the BART baseline. On Jeopardy question generation, human evaluators found RAG more factual than BART in 42.7% of pairwise comparisons versus only 7.1% where BART was more factual. RAG also generated substantially more diverse text: the ratio of distinct trigrams to total trigrams was 53.8% for RAG-Sequence versus 32.4% for BART on Jeopardy generation, approaching the 90.0% diversity of human-written gold questions.
Figure 1: The annotation interface used for human evaluation of factuality. Evaluators compared pairs of generated sentences (one from BART, one from RAG) and judged which was more factually true, using the internet to verify claims. This pairwise evaluation revealed that RAG produced more factual outputs in 42.7% of comparisons versus 7.1% for BART.
On FEVER fact verification, RAG achieved 72.5% accuracy on 3-way classification (supports / refutes / not enough info) without any supervision on which Wikipedia passages constitute evidence. This was within 4.3% of state-of-the-art pipeline systems that used complex domain-specific architectures and intermediate retrieval supervision. The top retrieved document matched a gold evidence article 71% of the time, and a gold article appeared in the top 10 results 90% of the time, showing the retriever learned to find relevant evidence purely through end-to-end training.
Frozen document encoder and index. The document encoder is never updated during fine-tuning, meaning document representations are fixed. If the pre-trained DPR encoder poorly represents certain document types (e.g., highly technical or domain-specific text), the retriever cannot improve its understanding of those documents. Updating the document encoder would require periodically re-encoding all 21 million documents and rebuilding the FAISS index, which the authors found unnecessary for their benchmarks but acknowledged as costly.
Wikipedia-only knowledge source. All experiments used a single December 2018 Wikipedia dump. This means RAG cannot answer questions about events after December 2018, about topics not covered by Wikipedia, or about proprietary/specialized domains (medical records, legal documents, codebases) without building an entirely new index.
Retrieval collapse on certain tasks. The authors observed in preliminary experiments that for tasks like story generation, the retriever would “collapse” – learning to retrieve the same documents regardless of input. When this happened, the generator learned to ignore the documents entirely and RAG degraded to plain BART. This suggests RAG’s retrieval mechanism is not universally beneficial and may fail when tasks do not have an explicit need for factual knowledge.
Fixed top-K retrieval at query time. The same set of \(K\) documents is retrieved once for the entire generation. There is no mechanism for the model to issue follow-up queries as it generates, which would be valuable for multi-hop reasoning (e.g., “What year was the president of the country that won the 2018 World Cup born?” requires first finding the country, then the president, then the birth year).
Computational overhead not thoroughly analyzed. The paper does not provide detailed latency or throughput comparisons. Retrieving from a 21-million-document index and running generation conditioned on \(K\) documents adds significant overhead compared to a parametric-only model. The ~100 GB CPU memory requirement for the uncompressed index (later reduced to 36 GB with compression) limits deployment scenarios.
No evaluation on non-English languages. All experiments use English-language Wikipedia and English benchmarks. Whether the approach generalizes to other languages – particularly low-resource ones with smaller Wikipedia dumps – was not explored.
RAG became one of the most influential ideas in applied NLP and AI systems. The term “Retrieval-Augmented Generation” entered the mainstream vocabulary of the AI community and eventually the broader technology industry. The core insight – that language models perform better when they can look up information rather than relying purely on memorized knowledge – became a foundational design pattern for production AI systems.
The practical impact was enormous. When large language models like GPT-3, GPT-4, and Claude were deployed in real-world applications, RAG-style architectures became the standard way to make them knowledgeable about specific domains, private data, or recent events. Enterprise AI applications routinely use RAG to connect language models to company knowledge bases, product documentation, or legal corpora. The pattern of “retrieve then generate” has become so ubiquitous that RAG pipelines are a first-class feature in every major AI platform and framework, including LangChain, LlamaIndex, and the HuggingFace ecosystem (which open-sourced the original RAG implementation).
The paper also catalyzed research on the boundary between parametric and non-parametric knowledge. Follow-up work explored improvements to each component: better retrievers (ColBERT, Contriever), better ways to integrate retrieved context (Fusion-in-Decoder, RETRO), and better ways to decide when retrieval is needed (Self-RAG). The index hot-swapping experiment – showing that RAG’s knowledge could be updated by simply swapping the document index without retraining – was an early demonstration of a property that became increasingly important as people sought to keep AI systems current and factual.
To understand this paper, you need:
Builds on Transformers (see Attention Is All You Need): RAG’s generator (BART) is a transformer encoder-decoder, and both BERT encoders in the retriever use the transformer architecture. The entire system is built on the attention mechanism and the encoder-decoder framework introduced in the Transformers paper.
Builds on BERT (see BERT): The retriever uses two BERT-base models – one as the query encoder and one as the document encoder. The pre-trained bidirectional representations from BERT are what make dense passage retrieval possible, converting variable-length text into fixed-size vectors that can be compared with dot products.
Builds on GPT-style language models (see Improving Language Understanding by Generative Pre-Training): RAG was partly motivated by observations that GPT-style models store factual knowledge in their parameters but struggle with accuracy and updatability. RAG provides an alternative: instead of scaling up parametric memory (as in T5-11B), augment a smaller model with non-parametric retrieval.
Relates to scaling laws (see Scaling Laws for Neural Language Models): RAG demonstrated that retrieval augmentation could outperform a model with 18 times more parameters (626M vs 11B), suggesting that not all “intelligence” needs to come from scaling model size – some can come from better access to external information.
Relates to ViT (see An Image is Worth 16x16 Words): While ViT focuses on vision, both papers share the principle of adapting the transformer architecture to new domains. RAG adapts it for retrieval-augmented generation; ViT adapts it for image recognition. Both demonstrate the versatility of the transformer as a general-purpose computational primitive.
Later work builds on RAG: The RAG pattern became foundational for InstructGPT-style systems and modern chatbots that use retrieval to ground their responses. Techniques like RETRO, Fusion-in-Decoder, and Self-RAG all extend the core idea. Parameter-efficient fine-tuning methods (like those in the LoRA paper) are often combined with RAG in practice – you can use LoRA to cheaply adapt the generator while RAG provides the knowledge retrieval.