MemGPT: Towards LLMs as Operating Systems

Authors: Charles Packer, Sarah Wooders, Kevin Lin et al. Year: 2023 Source: arXiv 2310.08560

One-Sentence Summary

MemGPT borrows the idea of virtual memory from operating systems – where a computer creates the illusion of having more RAM than it physically has by swapping data to disk – and applies it to large language models, letting them work with far more information than their fixed-size input window can hold at once.

Problem Statement

Large language models process text through a fixed-size input called a context window (the block of text the model can “see” at one time). In late 2023, the most widely used open-source models had context windows of just 2,000 to 8,000 tokens – roughly 20 to 140 back-and-forth messages. Even commercial models topped out around 128,000 tokens. This hard ceiling creates two concrete problems.

First, in document analysis, real-world documents regularly exceed these limits. A legal annual report (SEC Form 10-K) can surpass one million tokens. An LLM simply cannot read such a document because it does not fit in the input window. Second, in long-running conversations, a chatbot forgets everything that has scrolled past its context boundary. After a few dozen messages, the model has no memory of what was discussed earlier. It cannot maintain consistency (“You told me last week you broke up with James”) or personalize responses based on long-term knowledge.

The obvious fix – making context windows larger – faces two barriers. Computationally, the self-attention mechanism in transformers (see Attention Is All You Need) scales quadratically with sequence length: doubling the context quadruples the memory and compute cost. Empirically, research by Liu et al. (2023) showed that even when models have large contexts, they struggle to use information placed in the middle of the window – a phenomenon known as “lost in the middle” (see Lost in the Middle). Simply scaling context is neither efficient nor effective.

Key Innovation

Think about how your computer handles running ten programs at once even though its physical RAM can only hold three of them. The operating system creates the illusion of unlimited memory through a technique called virtual memory: it keeps the active programs in RAM and quietly swaps the inactive ones out to the hard drive. When you switch to a background program, the OS “pages” its data back into RAM and pushes something else out. You never notice the swap – to you, it looks like every program has all the memory it needs.

MemGPT applies this same trick to LLM context windows. The LLM’s context window plays the role of RAM (main memory), and external databases play the role of the hard drive (disk storage). The system gives the LLM a set of function calls – analogous to OS system calls – that let it read data from external storage into its context, write important facts out to persistent storage before they get evicted, and search through its stored history. The LLM itself decides when and what to page in or out, much like an OS manages page faults automatically.

What makes this different from prior retrieval-augmented approaches (see Retrieval-Augmented Generation) is that retrieval in MemGPT is self-directed. In standard RAG, an external pipeline fetches documents and stuffs them into the prompt before the model runs. The model has no say in what gets retrieved or when. In MemGPT, the LLM is both the processor and the memory manager: it actively decides to search, page through results, save information, and update its working notes. This turns retrieval from a one-shot preprocessing step into an iterative, multi-step reasoning process.

Architecture / Method

MemGPT structures the LLM’s accessible information into a two-tier hierarchy, directly mirroring how operating systems separate fast main memory from slower disk storage.

MemGPT system architecture showing the LLM’s finite context window (main context) containing system instructions, working context, and FIFO queue; the function executor and queue manager; and archival storage and recall storage (external context)

Figure 3: The MemGPT architecture. Main context (inside the LLM’s token limit) is divided into system instructions, working context, and a FIFO message queue. The function executor handles tool calls, and the queue manager enforces memory pressure warnings and eviction. Archival storage and recall storage provide unbounded external context accessible through function calls.

Main context is everything inside the LLM’s prompt tokens – the data the model can directly “see” during a single inference pass. Main context is divided into three contiguous sections: (1) System instructions, a read-only block containing the rules for how MemGPT works and descriptions of available functions. (2) Working context, a fixed-size read/write scratchpad where the LLM stores distilled facts – user preferences, persona information, key notes. The LLM can only modify this via explicit function calls. (3) A FIFO queue (first-in, first-out), which holds the rolling conversation history including user messages, system alerts, and function call results. The very first entry in this queue always contains a recursive summary of all messages that have been evicted from the queue, so the model retains a compressed sense of the full history.

External context lives outside the context window in two databases. Archival storage is a read/write database backed by vector search (using PostgreSQL with pgvector and HNSW indexing for sub-second approximate nearest-neighbor queries). The LLM can insert arbitrary text objects and later search them by semantic similarity. Recall storage is a complete log of every message ever exchanged – the LLM can search it by keyword, date, or content to bring old messages back into the FIFO queue.

The queue manager handles context overflow, playing the role of the OS page replacement policy. When the prompt tokens hit 70% of the context window capacity, the queue manager inserts a “memory pressure” warning – a system alert telling the LLM that eviction is imminent. This gives the LLM a chance to save important information from the FIFO queue into working context or archival storage. When tokens hit 100% capacity, the queue manager flushes the queue: it evicts roughly half the messages, generates a new recursive summary from the old summary plus the evicted messages, and stores the evicted messages in recall storage where they remain searchable.

The function executor parses the LLM’s output as function calls, executes them, and feeds the results back into the context. Available functions include working_context.append(), working_context.replace(), archival_storage.insert(), archival_storage.search(), recall_storage.search(), and a send_message() function to respond to the user. Function chaining allows multi-step operations: the LLM can include a request_heartbeat=true flag in any function call, which tells MemGPT to immediately run another inference cycle after the function completes (rather than waiting for the next user message). This lets the LLM chain together sequences like: search archival storage, page to the next result set, read a specific document, then respond – all within a single user turn.

Control flow is event-driven. Events that trigger inference include: user messages, system alerts (memory pressure warnings), user interactions (a document upload completing), and scheduled timers (allowing MemGPT to run autonomously without user input). Each event is converted to a plain text message, appended to the FIFO queue, and triggers an LLM inference cycle.

Mathematical Foundations

MemGPT is a systems architecture paper, not a mathematical one. It introduces no formal loss functions, optimization objectives, or novel mathematical formulations. Its contribution is entirely in system design – the memory hierarchy, control flow, and function-calling protocol. The “math” of the paper reduces to a handful of capacity parameters and threshold conditions that govern the queue manager’s behavior.

The context window capacity \(C\) is the maximum number of tokens the underlying LLM can accept as input. For example, GPT-4 has \(C = 8{,}192\) and GPT-4 Turbo has \(C = 128{,}000\).

The warning threshold is a fraction of capacity – typically \(0.7 \times C\) – at which the queue manager injects a memory pressure alert. For GPT-4, this fires at approximately \(5{,}734\) tokens.

The flush threshold is the hard limit, typically \(1.0 \times C\), at which the queue manager evicts messages. During a flush, approximately \(0.5 \times C\) tokens worth of messages are removed from the FIFO queue. These evicted messages are summarized via a recursive summarization call (a separate LLM inference that produces a compressed version) and stored in recall storage.

The approximate message capacity reported in the paper assumes a system prompt of 1,000 tokens and an average message length of 50 tokens (~250 characters). For a context window of \(C\) tokens:

\[\text{Messages} \approx \frac{C - 1000}{50}\]

where the numerator subtracts the system prompt overhead and the denominator is the average per-message token count. For example, with GPT-4’s \(C = 8{,}192\): Messages \(\approx (8192 - 1000) / 50 \approx 140\).

For the cosine similarity search used in archival storage (vector search over embedded text passages), the retrieval mechanism uses:

\[\text{sim}(q, d) = \frac{q \cdot d}{\|q\| \, \|d\|}\]

where \(q\) is the query embedding vector, \(d\) is a document embedding vector, \(q \cdot d\) is the dot product, and \(\|q\|\) and \(\|d\|\) are the L2 norms. MemGPT uses OpenAI’s text-embedding-ada-002 model to produce these embeddings and PostgreSQL’s pgvector extension with an HNSW (Hierarchical Navigable Small World) index for approximate nearest-neighbor search. This is the same similarity metric used in standard RAG pipelines (see Retrieval-Augmented Generation), but in MemGPT the LLM triggers the search itself rather than an external pipeline doing it.

Results

The paper evaluates MemGPT on two domains: multi-session chat (conversational agents) and document analysis.

Deep Memory Retrieval (DMR): In this task, a conversational agent must answer a specific question about a topic discussed across five prior conversation sessions. The baseline models receive only a lossy summary of past conversations, while MemGPT has access to the full conversation history via paginated search. Results show dramatic improvements:

ROUGE-L measures overlap between the model’s answer and the gold answer by finding the longest common subsequence of words.

Model Accuracy ROUGE-L (R)
GPT-3.5 Turbo 38.7% 0.394
+ MemGPT 66.9% 0.629
GPT-4 32.1% 0.296
+ MemGPT 92.5% 0.814
GPT-4 Turbo 35.3% 0.359
+ MemGPT 93.4% 0.827

MemGPT with GPT-4 Turbo achieves 93.4% accuracy versus 35.3% for the baseline – a 58 percentage point improvement. The key insight is that MemGPT’s iterative search through recall storage lets it find specific facts that a compressed summary inevitably loses.

Conversation Openers: In this engagement task, the agent crafts an opening message for a new conversation session that demonstrates knowledge of the user from prior sessions. MemGPT-generated openers matched or exceeded human-written openers on similarity scores to gold persona labels (SIM-1 = 0.868 for GPT-4 vs. 0.800 for humans), while being more verbose and covering more persona details. SIM-\(k\) measures the average cosine similarity between the agent’s message embedding and the \(k\) closest persona label embeddings – higher means the message better captures the user’s known traits.

Method SIM-1 SIM-3 SIM-H
Human 0.800 0.800 1.000
GPT-3.5 Turbo 0.830 0.812 0.817
GPT-4 0.868 0.843 0.773
GPT-4 Turbo 0.857 0.828 0.767

Document QA: MemGPT was benchmarked on a retriever-reader task from Liu et al. (2023) using NaturalQuestions-Open (a benchmark of real Google search queries paired with Wikipedia answers) over Wikipedia. Fixed-context baselines are limited by the retriever’s top-\(K\) documents that fit in their context window. MemGPT can iteratively query archival storage and page through results, so its effective context is not bounded by the window size. MemGPT’s accuracy remained stable as the number of available documents grew, while baselines degraded when forced to truncate documents to fit more of them into context.

Nested Key-Value Retrieval: This synthetic task tests multi-hop reasoning: the agent is given a key whose value may itself be another key, requiring chained lookups (0 to 4 nesting levels deep, 140 UUID pairs, ~8k tokens). GPT-3.5 and GPT-4 baselines hit 0% accuracy by 1 and 3 nesting levels respectively. MemGPT with GPT-4 maintained near-perfect accuracy across all nesting levels because it could execute multiple sequential function calls to chain the lookups together.

Limitations

Impact and Legacy

MemGPT introduced a conceptual framework – treating LLM context management as an operating systems problem – that became widely influential in the AI agent ecosystem. The idea that an LLM should manage its own memory through function calls rather than relying on external preprocessing pipelines anticipated the “agentic” paradigm that dominated AI development in 2024 and beyond. The MemGPT open-source project (later renamed Letta) became one of the most popular frameworks for building stateful AI agents with persistent memory.

The paper’s influence extends beyond its specific implementation. The concepts of working memory (a scratchpad the agent actively maintains), tiered storage (fast in-context vs. slow external databases), and self-directed retrieval (the agent decides when and what to fetch) became standard building blocks in agent frameworks. Systems like LangGraph, CrewAI, and AutoGen adopted variations of these patterns.

More broadly, MemGPT helped shift the field’s thinking about context limitations. Rather than viewing the fixed context window as a fundamental constraint requiring architectural changes to the transformer, MemGPT showed that system-level design around an existing model could achieve much of the same benefit. This “context engineering” approach – carefully managing what goes into the context window rather than simply making the window bigger – became a core practice in production LLM applications. The paper also highlighted the importance of the “lost in the middle” problem identified by Liu et al. (2023) by showing that even with large contexts, careful memory management outperforms naive context stuffing.

Prerequisites

To understand this paper, the reader needs:

Connections