Light
Dark

RAG is not Agent Memory

Company
February 13, 2025

RAG (“retrieval augmented generation”) is currently the go-to solution whenever we need to connect LLMs to external sources of information. RAG is even used as a form of memory for conversational agents, by retrieving old messages that are semantically similar to the most recent message.  

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, it doesn’t come without its limitations and risks. RAG often places irrelevant data into the context window, resulting in context pollution and degraded performance (especially for newer reasoning models). Although RAG is an important tool in the agents stack, it is far from a complete solution.  

Why connect LLMs to external data?

An LLM’s knowledge about the world is frozen. It doesn’t know the current weather in San Francisco or anything about your users unless you tell it. Even if you’re confident your LLM has the information users need in its weights, you may still want to connect it to external data to minimize the chances of hallucination.

For example, if you want your AI agent to reliably answer factual queries like, “How many calories are generally in an apple?” it should reference an external source like WebMD or a nutritional facts database rather than relying on the underlying LLM. Knowing what sources are in a model’s context window also allows you to double-check its work.

Connecting data (or memories) to agents with RAG

When people think about linking their LLMs to external data, they default to retrieval augmented generation (RAG). RAG rose in popularity because of its simplicity: you embed a document in your LLM workflow, use an algorithm to find the top-K-related snippets of that document, and then deposit those snippets into the context window. This can also be used to provide a rudimentary form of memory, by searching over old messages. But a simple process doesn’t always yield great results.

Limitations of RAG-based memory and context

1. RAG is single step

LLMs get “one shot” at retrieving the most relevant data and generating a response. While that can work, it often doesn’t. To illustrate this point, imagine, for a moment, that you are a teacher. You task your students with writing a book report on a novel they’ve never read before. The report needs to be two pages and three sections.

A RAG approach to helping students write this book report would look like this:

Step 1: Shred the book into pieces of paper, each a few sentences long.

Step 2: Collect the top 10 most relevant shreds according to your (very basic) report guidelines: two pages and three sections.

Step 3: Ask students to write a report based on the top 10 shreds.

This is not a good way to write a book report. The students have no context for those top 10 shreds, so it’s hard to form a summary of the book, let alone a central thesis. In desperation, they may even resort to making things up.

When working with LLM-driven agents, a scenario like this could yield similarly bad or even worse results. Typically, RAG sorts the shreds from Step 2 by cosine similarity, a methodology rooted in correlation and notoriously bad at finding relevant snippets. Putting the top 10 only possibly applicable excerpts into a context window will likely lead to irrelevant results.

2. RAG is purely reactive

If a user says, “Today is my birthday,” a RAG agent will search for the word “birthday” in the vector database. The problem with searching for the word “birthday” is that the agent relies on RAG to mimic having memory by retrieving potentially relevant previous messages.

But RAG isn't going to retrieve personalization information that isn’t semantically similar (i.e., according to an embedding model) to the searched word.

So, even if a user mentioned their favorite color or movie in the past, the model sees “movie,” “color,” and “birthday” as completely unrelated words. It won’t combine them into a response like, “Are you going to make it Star Wars-themed like your last party?” or “You should get blue cake since that's your favorite color!” as a best friend — who has a functioning memory — would.

In fact, you’ll never get that level of personalization by embedding search (how RAG is usually implemented) because that personal information will never be retrieved.

If not RAG, then what?

Let’s play this out using the book report example. Say the book is so long that you can’t dump it into an LLM context window. If you can’t use traditional RAG, what tools could you make to help students write their reports?

Ideally, you’d build something that helps them navigate to any page so they can read the book page-by-page, taking notes over time. And you’d create a text search tool, something similar to Google. That way, if the students identified key themes in the book, they could search for supporting examples.

Together, these tools would help students digest the content in the book, think through the arguments of their report, and find parts of the book that prove their points, ultimately developing a much more thorough and accurate report.

Agentic RAG

The approach described above, multi-step reasoning with tools, is “agentic RAG,” and it’s the underlying foundation of Letta. Letta’s design takes all the work that’s gone into developing the search and doc retrieval tools we use today and embeds those tools into LLMs, preparing the context window by summarizing and organizing its “memory.”

With agentic RAG, LLMs can paginate through multiple pages of results, potentially even traversing an entire dataset, while also maintaining state. If you used agentic RAG to build a book report, the process would look like this:

Step 1: Read the first five pages.

Step 2: Write a summary of those five pages.

Step 3: Read the next five pages.

Step 4: Update the summary based on new information.

Agentic RAG’s iterative methodology updates results each time it retrieves and reviews more information, generating a more holistic and accurate response than if it retrieved information and generated responses only once (as in traditional RAG).

Agentic RAG also solves the reactivity problem.

With agentic RAG, an AI agent isn’t doing a top-K match and dump. It’s already distilled important data it’s received in the past (a customer’s favorite movie or favorite color) and organized it in such a way that the model can proactively relate it to the user’s prompt and curate a response: “You should have a Star Wars-themed party!”, or “You should get blue decorations!”

Conclusion

For companies that want to build robust AI agents, traditional RAG is insufficient. The key to developing AI agents that are precise, interpretable, proactive, and deeply aligned with an organization’s unique goals and data environments is to give your LLMs memory.

If you’re curious about how to do that, check out the Letta quickstart guide, or enroll in our Deep Learning AI course on agent memory. Or, if you’re ready to build your own sophisticated agents, request early access to the Letta Cloud platform.

Jul 7, 2025
Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025
Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

Nov 14, 2024
The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024
New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024
Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024
MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Jul 24, 2025
Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025
Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025
Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025
Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024
Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024
Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024
Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024
Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024
Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

May 29, 2025
Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

May 14, 2025
Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Apr 21, 2025
Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.

Feb 6, 2025
Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.