MemGPT: Towards LLMs as Operating Systems

Authors: Charles Packer, Sarah Wooders, Kevin Lin et al. Year: 2023 Source: arXiv 2310.08560

One-Sentence Summary

MemGPT borrows the idea of virtual memory from operating systems – where a computer creates the illusion of having more RAM than it physically has by swapping data to disk – and applies it to large language models, letting them work with far more information than their fixed-size input window can hold at once.

Problem Statement

Large language models process text through a fixed-size input called a context window (the block of text the model can “see” at one time). In late 2023, the most widely used open-source models had context windows of just 2,000 to 8,000 tokens – roughly 20 to 140 back-and-forth messages. Even commercial models topped out around 128,000 tokens. This hard ceiling creates two concrete problems.

First, in document analysis, real-world documents regularly exceed these limits. A legal annual report (SEC Form 10-K) can surpass one million tokens. An LLM simply cannot read such a document because it does not fit in the input window. Second, in long-running conversations, a chatbot forgets everything that has scrolled past its context boundary. After a few dozen messages, the model has no memory of what was discussed earlier. It cannot maintain consistency (“You told me last week you broke up with James”) or personalize responses based on long-term knowledge.

The obvious fix – making context windows larger – faces two barriers. Computationally, the self-attention mechanism in transformers (see Attention Is All You Need) scales quadratically with sequence length: doubling the context quadruples the memory and compute cost. Empirically, research by Liu et al. (2023) showed that even when models have large contexts, they struggle to use information placed in the middle of the window – a phenomenon known as “lost in the middle” (see Lost in the Middle). Simply scaling context is neither efficient nor effective.

Key Innovation

Think about how your computer handles running ten programs at once even though its physical RAM can only hold three of them. The operating system creates the illusion of unlimited memory through a technique called virtual memory: it keeps the active programs in RAM and quietly swaps the inactive ones out to the hard drive. When you switch to a background program, the OS “pages” its data back into RAM and pushes something else out. You never notice the swap – to you, it looks like every program has all the memory it needs.

MemGPT applies this same trick to LLM context windows. The LLM’s context window plays the role of RAM (main memory), and external databases play the role of the hard drive (disk storage). The system gives the LLM a set of function calls – analogous to OS system calls – that let it read data from external storage into its context, write important facts out to persistent storage before they get evicted, and search through its stored history. The LLM itself decides when and what to page in or out, much like an OS manages page faults automatically.

What makes this different from prior retrieval-augmented approaches (see Retrieval-Augmented Generation) is that retrieval in MemGPT is self-directed. In standard RAG, an external pipeline fetches documents and stuffs them into the prompt before the model runs. The model has no say in what gets retrieved or when. In MemGPT, the LLM is both the processor and the memory manager: it actively decides to search, page through results, save information, and update its working notes. This turns retrieval from a one-shot preprocessing step into an iterative, multi-step reasoning process.

Architecture / Method

MemGPT structures the LLM’s accessible information into a two-tier hierarchy, directly mirroring how operating systems separate fast main memory from slower disk storage.

MemGPT system architecture showing the LLM’s finite context window (main context) containing system instructions, working context, and FIFO queue; the function executor and queue manager; and archival storage and recall storage (external context)

Figure 3: The MemGPT architecture. Main context (inside the LLM’s token limit) is divided into system instructions, working context, and a FIFO message queue. The function executor handles tool calls, and the queue manager enforces memory pressure warnings and eviction. Archival storage and recall storage provide unbounded external context accessible through function calls.

Main context is everything inside the LLM’s prompt tokens – the data the model can directly “see” during a single inference pass. Main context is divided into three contiguous sections: (1) System instructions, a read-only block containing the rules for how MemGPT works and descriptions of available functions. (2) Working context, a fixed-size read/write scratchpad where the LLM stores distilled facts – user preferences, persona information, key notes. The LLM can only modify this via explicit function calls. (3) A FIFO queue (first-in, first-out), which holds the rolling conversation history including user messages, system alerts, and function call results. The very first entry in this queue always contains a recursive summary of all messages that have been evicted from the queue, so the model retains a compressed sense of the full history.

External context lives outside the context window in two databases. Archival storage is a read/write database backed by vector search (using PostgreSQL with pgvector and HNSW indexing for sub-second approximate nearest-neighbor queries). The LLM can insert arbitrary text objects and later search them by semantic similarity. Recall storage is a complete log of every message ever exchanged – the LLM can search it by keyword, date, or content to bring old messages back into the FIFO queue.

The queue manager handles context overflow, playing the role of the OS page replacement policy. When the prompt tokens hit 70% of the context window capacity, the queue manager inserts a “memory pressure” warning – a system alert telling the LLM that eviction is imminent. This gives the LLM a chance to save important information from the FIFO queue into working context or archival storage. When tokens hit 100% capacity, the queue manager flushes the queue: it evicts roughly half the messages, generates a new recursive summary from the old summary plus the evicted messages, and stores the evicted messages in recall storage where they remain searchable.

The function executor parses the LLM’s output as function calls, executes them, and feeds the results back into the context. Available functions include working_context.append(), working_context.replace(), archival_storage.insert(), archival_storage.search(), recall_storage.search(), and a send_message() function to respond to the user. Function chaining allows multi-step operations: the LLM can include a request_heartbeat=true flag in any function call, which tells MemGPT to immediately run another inference cycle after the function completes (rather than waiting for the next user message). This lets the LLM chain together sequences like: search archival storage, page to the next result set, read a specific document, then respond – all within a single user turn.

Control flow is event-driven. Events that trigger inference include: user messages, system alerts (memory pressure warnings), user interactions (a document upload completing), and scheduled timers (allowing MemGPT to run autonomously without user input). Each event is converted to a plain text message, appended to the FIFO queue, and triggers an LLM inference cycle.

Mathematical Foundations

MemGPT is a systems architecture paper, not a mathematical one. It introduces no formal loss functions, optimization objectives, or novel mathematical formulations. Its contribution is entirely in system design – the memory hierarchy, control flow, and function-calling protocol. The “math” of the paper reduces to a handful of capacity parameters and threshold conditions that govern the queue manager’s behavior.

The context window capacity \(C\) is the maximum number of tokens the underlying LLM can accept as input. For example, GPT-4 has \(C = 8{,}192\) and GPT-4 Turbo has \(C = 128{,}000\).

The warning threshold is a fraction of capacity – typically \(0.7 \times C\) – at which the queue manager injects a memory pressure alert. For GPT-4, this fires at approximately \(5{,}734\) tokens.

The flush threshold is the hard limit, typically \(1.0 \times C\), at which the queue manager evicts messages. During a flush, approximately \(0.5 \times C\) tokens worth of messages are removed from the FIFO queue. These evicted messages are summarized via a recursive summarization call (a separate LLM inference that produces a compressed version) and stored in recall storage.

The approximate message capacity reported in the paper assumes a system prompt of 1,000 tokens and an average message length of 50 tokens (~250 characters). For a context window of \(C\) tokens:

\[\text{Messages} \approx \frac{C - 1000}{50}\]

where the numerator subtracts the system prompt overhead and the denominator is the average per-message token count. For example, with GPT-4’s \(C = 8{,}192\): Messages \(\approx (8192 - 1000) / 50 \approx 140\).

For the cosine similarity search used in archival storage (vector search over embedded text passages), the retrieval mechanism uses:

\[\text{sim}(q, d) = \frac{q \cdot d}{\|q\| \, \|d\|}\]

where \(q\) is the query embedding vector, \(d\) is a document embedding vector, \(q \cdot d\) is the dot product, and \(\|q\|\) and \(\|d\|\) are the L2 norms. MemGPT uses OpenAI’s text-embedding-ada-002 model to produce these embeddings and PostgreSQL’s pgvector extension with an HNSW (Hierarchical Navigable Small World) index for approximate nearest-neighbor search. This is the same similarity metric used in standard RAG pipelines (see Retrieval-Augmented Generation), but in MemGPT the LLM triggers the search itself rather than an external pipeline doing it.

Results

The paper evaluates MemGPT on two domains: multi-session chat (conversational agents) and document analysis.

Deep Memory Retrieval (DMR): In this task, a conversational agent must answer a specific question about a topic discussed across five prior conversation sessions. The baseline models receive only a lossy summary of past conversations, while MemGPT has access to the full conversation history via paginated search. Results show dramatic improvements:

ROUGE-L measures overlap between the model’s answer and the gold answer by finding the longest common subsequence of words.

Model	Accuracy	ROUGE-L (R)
GPT-3.5 Turbo	38.7%	0.394
+ MemGPT	66.9%	0.629
GPT-4	32.1%	0.296
+ MemGPT	92.5%	0.814
GPT-4 Turbo	35.3%	0.359
+ MemGPT	93.4%	0.827

MemGPT with GPT-4 Turbo achieves 93.4% accuracy versus 35.3% for the baseline – a 58 percentage point improvement. The key insight is that MemGPT’s iterative search through recall storage lets it find specific facts that a compressed summary inevitably loses.

Conversation Openers: In this engagement task, the agent crafts an opening message for a new conversation session that demonstrates knowledge of the user from prior sessions. MemGPT-generated openers matched or exceeded human-written openers on similarity scores to gold persona labels (SIM-1 = 0.868 for GPT-4 vs. 0.800 for humans), while being more verbose and covering more persona details. SIM-\(k\) measures the average cosine similarity between the agent’s message embedding and the \(k\) closest persona label embeddings – higher means the message better captures the user’s known traits.

Method	SIM-1	SIM-3	SIM-H
Human	0.800	0.800	1.000
GPT-3.5 Turbo	0.830	0.812	0.817
GPT-4	0.868	0.843	0.773
GPT-4 Turbo	0.857	0.828	0.767

Document QA: MemGPT was benchmarked on a retriever-reader task from Liu et al. (2023) using NaturalQuestions-Open (a benchmark of real Google search queries paired with Wikipedia answers) over Wikipedia. Fixed-context baselines are limited by the retriever’s top-\(K\) documents that fit in their context window. MemGPT can iteratively query archival storage and page through results, so its effective context is not bounded by the window size. MemGPT’s accuracy remained stable as the number of available documents grew, while baselines degraded when forced to truncate documents to fit more of them into context.

Nested Key-Value Retrieval: This synthetic task tests multi-hop reasoning: the agent is given a key whose value may itself be another key, requiring chained lookups (0 to 4 nesting levels deep, 140 UUID pairs, ~8k tokens). GPT-3.5 and GPT-4 baselines hit 0% accuracy by 1 and 3 nesting levels respectively. MemGPT with GPT-4 maintained near-perfect accuracy across all nesting levels because it could execute multiple sequential function calls to chain the lookups together.

Limitations

No formal memory management policy. The LLM decides what to save and evict based on natural language instructions in the system prompt. There is no learned or optimized eviction policy. If the LLM misjudges what information is important, critical data can be lost when the queue flushes.
Latency cost. Each function call triggers a full LLM inference pass. A single user question might require 5-10 sequential inference calls (search, page, search again, read, respond), making MemGPT significantly slower than a single-pass LLM response. The paper does not report latency numbers.
Strong dependence on function calling quality. MemGPT with GPT-3.5 performs substantially worse than with GPT-4 across all tasks, because GPT-3.5 struggles with function calling. The system is only as good as the underlying model’s ability to correctly invoke functions.
Recursive summarization is lossy. When the FIFO queue is flushed, the evicted messages are compressed into a summary. This summary inevitably loses detail. The paper acknowledges this indirectly by showing that MemGPT’s recall storage search outperforms summary-based baselines, but the summary itself within MemGPT still suffers the same compression loss.
Evaluation is narrow. The paper tests only on QA-style tasks and multi-session chat. It does not evaluate on tasks requiring complex reasoning over retrieved information, creative writing with long context, or real-time applications where latency matters.
No comparison to long-context models used natively. The paper compares MemGPT against models constrained to their default context windows. It does not compare against, for example, Claude 2 (100k context) or GPT-4 Turbo (128k context) used with all documents simply placed in the full context window – the most direct alternative to MemGPT’s approach.
Token costs. The multiple inference calls per user turn multiply the token usage (and API cost) by a factor that depends on the task complexity. The paper does not analyze this cost.

Impact and Legacy

MemGPT introduced a conceptual framework – treating LLM context management as an operating systems problem – that became widely influential in the AI agent ecosystem. The idea that an LLM should manage its own memory through function calls rather than relying on external preprocessing pipelines anticipated the “agentic” paradigm that dominated AI development in 2024 and beyond. The MemGPT open-source project (later renamed Letta) became one of the most popular frameworks for building stateful AI agents with persistent memory.

The paper’s influence extends beyond its specific implementation. The concepts of working memory (a scratchpad the agent actively maintains), tiered storage (fast in-context vs. slow external databases), and self-directed retrieval (the agent decides when and what to fetch) became standard building blocks in agent frameworks. Systems like LangGraph, CrewAI, and AutoGen adopted variations of these patterns.

More broadly, MemGPT helped shift the field’s thinking about context limitations. Rather than viewing the fixed context window as a fundamental constraint requiring architectural changes to the transformer, MemGPT showed that system-level design around an existing model could achieve much of the same benefit. This “context engineering” approach – carefully managing what goes into the context window rather than simply making the window bigger – became a core practice in production LLM applications. The paper also highlighted the importance of the “lost in the middle” problem identified by Liu et al. (2023) by showing that even with large contexts, careful memory management outperforms naive context stuffing.

Prerequisites

To understand this paper, the reader needs:

Basic understanding of LLMs and context windows: what it means for a model to have a fixed input size, and why information outside that window is invisible to the model (see Attention Is All You Need for the transformer architecture that creates this constraint, and Improving Language Understanding by Generative Pre-Training for how language models are pretrained)
Operating systems concepts: virtual memory, paging, page faults, RAM vs. disk. These are the core analogies the paper builds on. Any introductory OS textbook covers this.
Retrieval-augmented generation: the idea of fetching external documents and inserting them into a model’s input (see Retrieval-Augmented Generation)
Function calling / tool use in LLMs: the ability of a language model to generate structured outputs that invoke external functions. This emerged with GPT-4 and related models in 2023.
Vector similarity search: embedding text as vectors and finding nearest neighbors using cosine similarity. This is the mechanism behind MemGPT’s archival storage search.

Connections

Builds on the transformer architecture (see Attention Is All You Need): MemGPT exists because the self-attention mechanism in transformers creates a fixed context window with quadratic scaling costs, making it impractical to simply extend context length indefinitely.
Extends retrieval-augmented generation (see Retrieval-Augmented Generation): RAG introduced the idea of augmenting LLM input with retrieved documents. MemGPT takes this further by making retrieval self-directed – the LLM itself decides when to retrieve, what queries to make, and how many pages of results to examine, rather than relying on a one-shot external retrieval step.
Directly motivated by the “lost in the middle” finding (see Lost in the Middle): Liu et al. (2023) showed that language models attend unevenly to their context, performing best on information at the beginning or end and worst on information in the middle. MemGPT cites this as evidence that simply scaling context windows is insufficient – even with more context, models struggle to use it effectively. MemGPT’s solution is to keep contexts small and focused, bringing in only the relevant information when needed.
Shares the agent/tool-use paradigm with the ReAct framework (see ReAct): both systems let an LLM interleave reasoning with actions (function calls). MemGPT’s function chaining mechanism (using request_heartbeat=true to trigger follow-up inference) is conceptually similar to ReAct’s thought-action-observation loop, but specialized for memory management rather than general task execution.
Connects to BERT’s bidirectional attention (see BERT): the paper references the broader family of transformer models (both autoregressive like GPT and bidirectional like BERT) as all sharing the fixed-context limitation that MemGPT addresses.