Letta Leaderboard: Benchmarking LLMs on Agentic Memory

Research

May 29, 2025

To see the full leaderboard results, check the live leaderboard page.

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory. While memory is essential for building stateful agents that can remember and learn over extended periods, existing LLM leaderboards focus almost exclusively on general question-answering and coding tasks.

Benchmarking Agentic Memory

First introduced by MemGPT, agentic memory management empowers agents to control their own memory through tool calling. This approach enables agents to overcome limited context sizes by providing tools to write to in-context memory (called memory blocks in Letta) and search external data sources through agentic RAG. With automatic memory and context management, agents can both personalize their responses based on past experiences and tackle increasingly complex tasks. A deep research agent, for instance, might save its current research plan and progress to a memory block to stay on track during a long-running research process (try it out here). For long-running agents, effective memory management is crucial—it enables them to solve complex tasks and adapt over time while avoiding derailment and memory loss.

Letta is a framework for building stateful agents that persist across sessions, enabling long-running agent applications that learn over time. Since Letta agents depend heavily on agentic memory management for context management and accessing external data, developers have discovered that model selection significantly impacts agent performance. Yet this performance does not often correlate to rankings seen on existing leaderboards, making model choice challenging in the context of stateful agents. We’re releasing Letta Leaderboard to help developers understand the cost and performance tradeoffs for different models for agentic memory management.

The Letta Leaderboard

We measure memory management for three capabilities of a stateful agent: reading, writing, and updating. Letta's stateful agent memory hierarchy has two levels: core memory and archival memory. Core memory is what is inside the agent’s context window (organized by memory blocks), and archival memory is managing context external to the agent. We evaluate capabilities for both memory components.

In order to evaluate the effectiveness of different memory operations, we generate different groups of synthetic facts and questions. All the questions are directly answerable by some of the facts, and are completely fictional, thus not available to the agent otherwise.

To evaluate, we use a prompted GPT-4.1 to grade the agent response and the ground-truth answer, following SimpleQA. We add a penalty for extraneous memory operations.

Memory Read Benchmark

To evaluate how agents are at reading from memories, we develop two tasks for core memory and archival memory, respectively. For core memory reads, we place facts in the core memory blocks in the agent’s context window. Then, the agent is asked a question that can be answered by the facts in the core memory block. For archival memory reads, the same facts are stored inside archival memory instead, with no information inside the agent’s context window.

Core Memory Read Benchmark

When creating the evaluation agent, we add relevant information to answer questions to a memory block called "Supporting Facts".

‍

Then, we send the question as a message to the agent. For example: "Who is the documentary filmmaker that featured John Buckler in a film about coral bleaching?". Because the related facts to answer this question are inside the agent's core memory, the correct behavior for the agent is to directly answer the question without calling other tools or using search.

We use GPT-4.1 to evaluate whether the agent's response matches the ground-truth answer (in this case, Mark Smithe).

Archival Memory Read Benchmark

Archival memory read benchmark measures how well models understand when to query for information beyond the information in their immediate context.

To evaluate the agent's ability to retrieve missing information about the current topic, we use the same dataset, but store all relevant facts in the archival memory. This memory is hidden from the agent unless it uses the archival memory search method. Since the questions are synthetic and involve fictional characters, the agent has no prior knowledge of them and must rely on the archival memory to answer correctly.

Memory Write Benchmark

Instead of creating a memory block filled with supporting facts, we employ a simulated user to chat about all the relevant facts with the agent. The correct behavior from the agent is to write the important supporting facts to its memory, either core or archival, and later retrieve from it.

The relevant facts are sent as messages, as if we are chatting with the agents about "John Buckler " or "Mark Smithe". Then, we remove the chat history with the agent and ask the same question - the agent can only answer the question correctly if they invoked core memory append for the correct supporting facts.

Memory Update Benchmark

The memory update benchmark measures agents' understanding of their own memory and how well it can update any potential changes to it.

To simulate updating core memory, we generate conflicting facts with the previous group, along with corresponding questions and answers. Starting from the core memory read benchmark setup, all the supporting facts are stored inside a memory block. Then, we send a contradicting fact:

With the updated fact, the agent is expected to update its own memory block in order to answer the same question correctly (the new answer is Jason Sandstorm).

Understanding the Results

We show the result (average score on aforementioned benchmarks) and cost (in $) for top 10 models. Full results on the Letta leaderboard can be found on our documentation.

Top-performing models, such as Claude 4 Sonnet (with and without Extended Thinking), GPT 4.1, and GPT 4o, consistently deliver high scores across core and archival memory tasks.

For cost-sensitive deployments, Gemini 2.5 Flash and GPT 4o-mini are strong options. While not top-scoring, they maintain solid memory performance at a fraction of the cost—great for large-scale or resource-constrained applications.

Extending the Letta Leaderboard

We are actively updating Letta Leaderboard as newer models are released and will be adding more long-horizon tasks, external tool calls, and memory reorganization (via sleep-time compute). Of course, models vary along many axes, and selecting the most suitable model involves subjective judgements like style. We also aim to make the leaderboard extensible, making it easy to create evaluations tailored to specific use cases. We welcome community contributions to the leaderboard!

Key Takeaways

Anthropic Claude Sonnet 4 (with extended thinking budget) and OpenAI GPT 4.1 are recommended models for daily tasks. Google Gemini 2.5 Flash and OpenAI GPT 4o-mini are recommended, cost-effective models.
Models that perform well on archival memory (e.g., Claude Haiku 3-5) might overuse memory operations when unnecessary and receive a lower score on core memory due to penalties.

Check out the live leaderboard on our docs, or head to the GitHub repo to run the benchmark yourself!

Back

Twitter/X

Company

Company announcements, partnerships

Jul 7, 2025

Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025

Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

Feb 13, 2025

RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Nov 14, 2024

The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024

New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024

Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024

MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Product

Release notes, feature announcements

Jul 24, 2025

Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025

Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025

Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025

Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024

Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024

Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024

Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024

Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024

Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Research

Sleep-time compute, anatomy of a context window

May 14, 2025

Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Apr 21, 2025

Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.

Feb 6, 2025

Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Letta Leaderboard: Benchmarking LLMs on Agentic Memory

Benchmarking Agentic Memory

The Letta Leaderboard

Memory Read Benchmark

Core Memory Read Benchmark

Archival Memory Read Benchmark

Memory Write Benchmark

Memory Update Benchmark

Understanding the Results

Extending the Letta Leaderboard

Key Takeaways

Company

Product

Research

Product

DEVELOPERS

Company

Newsletter