Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus et al. Year: 2020 Source: arXiv 2005.11401

One-Sentence Summary

RAG gives a text-generation model the ability to look up relevant Wikipedia passages before answering, combining a search engine with a language generator so the system can produce responses grounded in real-world facts rather than relying solely on memorized knowledge.

Problem Statement

By 2020, large pre-trained language models like GPT-2 and T5 had demonstrated an impressive ability to answer factual questions by drawing on knowledge absorbed during training. The model parameters themselves acted as a kind of implicit database: ask “Who wrote Hamlet?” and the model could produce “William Shakespeare” purely from patterns learned over billions of words. But this parametric knowledge had three serious problems.

First, the knowledge was frozen at training time. If a new president took office or a scientific discovery was published after training, the model had no way to learn about it without expensive retraining. Second, there was no way to inspect where an answer came from. When a model said “The capital of Australia is Sydney” (incorrect – it is Canberra), you could not point to the source it relied on, making it hard to debug or trust. Third, the model could “hallucinate” – confidently generate plausible-sounding but entirely fabricated facts, because its generation process had no mechanism to verify claims against any external source.

Meanwhile, a parallel line of work on retrieval-based models like REALM and ORQA had shown that pairing a language model with a document retriever could help with extractive question answering (where the answer is copied verbatim from a retrieved passage). But these systems were limited to extraction – they could highlight a span in a document but could not synthesize a free-form answer, verify a claim, or generate a Jeopardy question. No one had built a general-purpose system that combined retrieval with open-ended text generation across many different task types.

Key Innovation

Think of the difference between a closed-book exam and an open-book exam. In a closed-book exam, you must answer every question from memory alone. If you misremember a fact, you give the wrong answer and have no way to check yourself. In an open-book exam, you can flip through a reference book to find relevant passages before writing your answer. You still need to understand the question and compose a coherent response, but you can ground your answer in actual source material.

RAG is an open-book exam for language models. Given a question (or any input text), RAG first searches a large corpus – in this case, all of Wikipedia split into 21 million 100-word passages – to find the most relevant documents. It then feeds both the original question and the retrieved documents to a text generator (BART, a seq2seq transformer (see Attention Is All You Need)), which produces the final answer. The retriever and generator are trained together end-to-end, meaning the system learns what to look up based on what helps it produce better answers.

The key technical insight is treating the retrieved documents as latent variables – hidden choices that the model considers but that are not directly supervised. The paper proposes two ways to combine evidence from multiple retrieved documents: RAG-Sequence, which picks one document and generates the entire answer from it (then averages across documents), and RAG-Token, which can consult a different document for each word it generates. This latent-variable formulation lets the system learn to retrieve useful documents without anyone having to label which document is “correct” for each question.

Architecture / Method

RAG has two main components that work together: a retriever that finds relevant documents and a generator that produces text conditioned on those documents.

RAG architecture overview showing the query encoder and document index feeding retrieved documents to the generator

Figure 1: Overview of the RAG approach. A pre-trained retriever (query encoder + document index) finds the top-K relevant documents using Maximum Inner Product Search (MIPS). These documents are concatenated with the input and passed to a pre-trained seq2seq generator (BART). The retrieved documents are treated as latent variables, marginalized over during generation.

The Retriever (Dense Passage Retriever / DPR). The retriever uses two separate BERT encoders (see BERT). One encoder converts the input query \(x\) into a vector \(q(x)\), and the other pre-computes a vector \(d(z)\) for every document \(z\) in the corpus. To find the most relevant documents for a query, the system computes the dot product between the query vector and every document vector, then selects the top-\(K\) documents with the highest scores. This is a Maximum Inner Product Search (MIPS) problem, solved efficiently using the FAISS library (Facebook AI Similarity Search, a library for fast nearest-neighbor lookup in high-dimensional vector spaces) with approximate nearest-neighbor indexing so that searching 21 million documents takes sub-linear time rather than requiring a brute-force scan.

The Generator (BART). The generator is BART-large, a 400-million-parameter encoder-decoder transformer. For each retrieved document \(z\), the model concatenates the original input \(x\) with the document text \(z\) and feeds this combined input to the encoder. The decoder then generates the output sequence token by token, conditioned on the encoded representation. BART was pre-trained with a denoising objective – learning to reconstruct text that has been corrupted with various noise functions – giving it strong general-purpose generation abilities.

Two Marginalization Strategies. Since the model retrieves \(K\) documents, it needs a way to combine the \(K\) different generation results into a single output distribution. RAG-Sequence treats each document as a complete “hypothesis source”: it generates the full output sequence conditioned on each document separately, then weights each sequence by the retriever’s probability for that document and sums. RAG-Token is more fine-grained: for each output token position, it sums the generation probabilities across all \(K\) documents (weighted by retriever scores), allowing a single output sentence to draw facts from multiple documents. For instance, when generating a Jeopardy clue about Hemingway, the model might consult one document for “The Sun Also Rises” and another for “A Farewell to Arms.”

Training. The training objective is straightforward: minimize the negative log-likelihood of the correct output \(y\) given the input \(x\), where the likelihood is computed by marginalizing over the top-\(K\) retrieved documents. Crucially, there is no supervision on which documents should be retrieved – the system learns this end-to-end through backpropagation. The document encoder and index are kept fixed during training (updating them would require periodically rebuilding the entire 21-million-document index). Only the query encoder and the BART generator parameters are fine-tuned.

Decoding. At test time, RAG-Token decoding is straightforward: the per-token marginalized probabilities can be plugged directly into standard beam search (a decoding strategy that tracks the top-\(b\) most probable partial sequences at each step, expanding all of them before pruning). RAG-Sequence decoding is more complex because the likelihood does not decompose into independent per-token terms. The paper proposes “Thorough Decoding” (run beam search per document, then re-score hypotheses across all documents) and “Fast Decoding” (approximate by assuming zero probability for hypotheses not generated from a given document).

Mathematical Foundations

1. RAG-Sequence: Per-Sequence Marginalization

\[p_{\text{RAG-Sequence}}(y|x) \approx \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z|x) \prod_{i}^{N} p_\theta(y_i|x, z, y_{1:i-1})\]

In plain language: for each of the top-\(K\) documents, the generator produces the entire output sequence, and we compute the probability of that sequence. We then take a weighted average across documents, where the weights come from the retriever. This is like asking \(K\) different “experts” (each reading a different document) to each write a complete answer, then blending their responses by how relevant their source document was.

This matters because it lets the model consider multiple possible evidence sources while still producing a single coherent answer. The approximation (using only top-\(K\) instead of all 21 million documents) makes the computation tractable.

2. RAG-Token: Per-Token Marginalization

\[p_{\text{RAG-Token}}(y|x) \approx \prod_{i}^{N} \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z|x) p_\theta(y_i|x, z, y_{1:i-1})\]

The critical difference from RAG-Sequence is the order of the product and sum. Here, for each token position \(i\), we first compute a weighted average of the generation probabilities across all \(K\) documents, then multiply these per-token marginals together. This means that token \(y_1\) might be most influenced by document 3, while token \(y_5\) might draw primarily from document 1.

This matters because it allows a single generated sentence to synthesize information from multiple documents. For example, consider generating the sentence “The capital of France is Paris, located on the Seine river.” The word “Paris” might be most strongly supported by one Wikipedia article, while “Seine river” might be supported by a different article.

3. Retriever Probability (DPR Bi-Encoder)

\[p_\eta(z|x) \propto \exp\left(d(z)^\top q(x)\right)\]

\[d(z) = \text{BERT}_d(z), \quad q(x) = \text{BERT}_q(x)\]

In plain language: encode the question and document each as a vector in the same space, compute their dot product as a similarity score, then convert scores to probabilities using the softmax function (exponentiate and normalize). Documents whose vectors point in a similar direction to the query vector get high probability.

To make this concrete, suppose \(q(x) = [0.3, 0.8, 0.1]\) and \(d(z) = [0.4, 0.7, 0.2]\). The dot product is \(0.3 \times 0.4 + 0.8 \times 0.7 + 0.1 \times 0.2 = 0.12 + 0.56 + 0.02 = 0.70\). Then \(\exp(0.70) \approx 2.01\). If another document gives a dot product of 0.3, then \(\exp(0.3) \approx 1.35\), and the first document would receive probability \(2.01 / (2.01 + 1.35) \approx 0.60\).

This matters because it makes retrieval differentiable – the query encoder can be updated through gradient descent to produce better queries, learning to retrieve documents that actually help the generator produce correct answers.

4. Training Objective

\[\mathcal{L} = \sum_j -\log p(y_j|x_j)\]

In plain language: for each training example, compute how much probability the model assigns to the correct answer (after marginalizing over retrieved documents), take the negative logarithm (so higher probability means lower loss), and sum across all examples. Minimize this sum using stochastic gradient descent with the Adam optimizer.

This matters because it trains the retriever and generator jointly. Gradients flow back through the generator into the query encoder: if retrieving different documents would improve the generator’s output, the query encoder updates to retrieve those documents next time.

5. RAG-Token Decoding Transition

\[p'_\theta(y_i|x, y_{1:i-1}) = \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z_i|x) p_\theta(y_i|x, z_i, y_{1:i-1})\]

In plain language: this reformulates RAG-Token as a standard autoregressive model with a modified transition probability. Instead of the generator consulting a single context, it consults \(K\) documents and blends their predictions. Because this has the standard autoregressive form (each token’s probability depends only on previous tokens), standard beam search decoding works directly.

This matters because it shows RAG-Token is compatible with existing efficient decoding algorithms – no special decoding procedure is needed, unlike RAG-Sequence which requires running beam search per document.

Results

RAG achieved state-of-the-art results on four open-domain question answering benchmarks at the time of publication. The table below shows Exact Match scores, where a prediction counts as correct only if it exactly matches the gold answer.

Model Natural Questions TriviaQA WebQuestions CuratedTrec
T5-11B (closed-book) 34.5 50.1 37.4 -
T5-11B+SSM (closed-book) 36.6 60.5 44.7 -
REALM (open-book) 40.4 - 40.7 46.8
DPR (open-book) 41.5 57.9 41.1 50.6
RAG-Token 44.1 55.2 45.5 50.0
RAG-Sequence 44.5 56.8 45.2 52.2

The numbers tell a striking story. RAG-Sequence beat the closed-book T5-11B model on Natural Questions by 10 points (44.5 vs 34.5) despite having only 626 million trainable parameters compared to T5’s 11 billion – roughly 18 times fewer parameters. This demonstrated that giving a model access to external knowledge through retrieval can be far more parameter-efficient than trying to memorize everything. RAG also outperformed DPR, a dedicated extractive QA system that used a BERT cross-encoder re-ranker (a model that reads the query and document together to judge relevance, more accurate but slower than the bi-encoder retriever) and an extractive reader, showing that a general-purpose generative approach could surpass specialized extraction pipelines. Notably, RAG generated correct answers 11.8% of the time on Natural Questions even when the correct answer did not appear verbatim in any retrieved document – something impossible for any extractive system.

For text generation tasks, RAG showed clear qualitative improvements over the BART baseline. On Jeopardy question generation, human evaluators found RAG more factual than BART in 42.7% of pairwise comparisons versus only 7.1% where BART was more factual. RAG also generated substantially more diverse text: the ratio of distinct trigrams to total trigrams was 53.8% for RAG-Sequence versus 32.4% for BART on Jeopardy generation, approaching the 90.0% diversity of human-written gold questions.

Human evaluation annotation interface for factuality assessment

Figure 1: The annotation interface used for human evaluation of factuality. Evaluators compared pairs of generated sentences (one from BART, one from RAG) and judged which was more factually true, using the internet to verify claims. This pairwise evaluation revealed that RAG produced more factual outputs in 42.7% of comparisons versus 7.1% for BART.

On FEVER fact verification, RAG achieved 72.5% accuracy on 3-way classification (supports / refutes / not enough info) without any supervision on which Wikipedia passages constitute evidence. This was within 4.3% of state-of-the-art pipeline systems that used complex domain-specific architectures and intermediate retrieval supervision. The top retrieved document matched a gold evidence article 71% of the time, and a gold article appeared in the top 10 results 90% of the time, showing the retriever learned to find relevant evidence purely through end-to-end training.

Limitations

Impact and Legacy

RAG became one of the most influential ideas in applied NLP and AI systems. The term “Retrieval-Augmented Generation” entered the mainstream vocabulary of the AI community and eventually the broader technology industry. The core insight – that language models perform better when they can look up information rather than relying purely on memorized knowledge – became a foundational design pattern for production AI systems.

The practical impact was enormous. When large language models like GPT-3, GPT-4, and Claude were deployed in real-world applications, RAG-style architectures became the standard way to make them knowledgeable about specific domains, private data, or recent events. Enterprise AI applications routinely use RAG to connect language models to company knowledge bases, product documentation, or legal corpora. The pattern of “retrieve then generate” has become so ubiquitous that RAG pipelines are a first-class feature in every major AI platform and framework, including LangChain, LlamaIndex, and the HuggingFace ecosystem (which open-sourced the original RAG implementation).

The paper also catalyzed research on the boundary between parametric and non-parametric knowledge. Follow-up work explored improvements to each component: better retrievers (ColBERT, Contriever), better ways to integrate retrieved context (Fusion-in-Decoder, RETRO), and better ways to decide when retrieval is needed (Self-RAG). The index hot-swapping experiment – showing that RAG’s knowledge could be updated by simply swapping the document index without retraining – was an early demonstration of a property that became increasingly important as people sought to keep AI systems current and factual.

Prerequisites

To understand this paper, you need:

Connections