Light
Dark

RAG is not Agent Memory

February 13, 2025

RAG (“retrieval augmented generation”) is currently the go-to solution whenever we need to connect LLMs to external sources of information. RAG is even used as a form of memory for conversational agents, by retrieving old messages that are semantically similar to the most recent message.  

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, it doesn’t come without its limitations and risks. RAG often places irrelevant data into the context window, resulting in context pollution and degraded performance (especially for newer reasoning models). Although RAG is an important tool in the agents stack, it is far from a complete solution.  

Why connect LLMs to external data?

An LLM’s knowledge about the world is frozen. It doesn’t know the current weather in San Francisco or anything about your users unless you tell it. Even if you’re confident your LLM has the information users need in its weights, you may still want to connect it to external data to minimize the chances of hallucination.

For example, if you want your AI agent to reliably answer factual queries like, “How many calories are generally in an apple?” it should reference an external source like WebMD or a nutritional facts database rather than relying on the underlying LLM. Knowing what sources are in a model’s context window also allows you to double-check its work.

Connecting data (or memories) to agents with RAG

When people think about linking their LLMs to external data, they default to retrieval augmented generation (RAG). RAG rose in popularity because of its simplicity: you embed a document in your LLM workflow, use an algorithm to find the top-K-related snippets of that document, and then deposit those snippets into the context window. This can also be used to provide a rudimentary form of memory, by searching over old messages. But a simple process doesn’t always yield great results.

Limitations of RAG-based memory and context

1. RAG is single step

LLMs get “one shot” at retrieving the most relevant data and generating a response. While that can work, it often doesn’t. To illustrate this point, imagine, for a moment, that you are a teacher. You task your students with writing a book report on a novel they’ve never read before. The report needs to be two pages and three sections.

A RAG approach to helping students write this book report would look like this:

Step 1: Shred the book into pieces of paper, each a few sentences long.

Step 2: Collect the top 10 most relevant shreds according to your (very basic) report guidelines: two pages and three sections.

Step 3: Ask students to write a report based on the top 10 shreds.

This is not a good way to write a book report. The students have no context for those top 10 shreds, so it’s hard to form a summary of the book, let alone a central thesis. In desperation, they may even resort to making things up.

When working with LLM-driven agents, a scenario like this could yield similarly bad or even worse results. Typically, RAG sorts the shreds from Step 2 by cosine similarity, a methodology rooted in correlation and notoriously bad at finding relevant snippets. Putting the top 10 only possibly applicable excerpts into a context window will likely lead to irrelevant results.

2. RAG is purely reactive

If a user says, “Today is my birthday,” a RAG agent will search for the word “birthday” in the vector database. The problem with searching for the word “birthday” is that the agent relies on RAG to mimic having memory by retrieving potentially relevant previous messages.

But RAG isn't going to retrieve personalization information that isn’t semantically similar (i.e., according to an embedding model) to the searched word.

So, even if a user mentioned their favorite color or movie in the past, the model sees “movie,” “color,” and “birthday” as completely unrelated words. It won’t combine them into a response like, “Are you going to make it Star Wars-themed like your last party?” or “You should get blue cake since that's your favorite color!” as a best friend — who has a functioning memory — would.

In fact, you’ll never get that level of personalization by embedding search (how RAG is usually implemented) because that personal information will never be retrieved.

If not RAG, then what?

Let’s play this out using the book report example. Say the book is so long that you can’t dump it into an LLM context window. If you can’t use traditional RAG, what tools could you make to help students write their reports?

Ideally, you’d build something that helps them navigate to any page so they can read the book page-by-page, taking notes over time. And you’d create a text search tool, something similar to Google. That way, if the students identified key themes in the book, they could search for supporting examples.

Together, these tools would help students digest the content in the book, think through the arguments of their report, and find parts of the book that prove their points, ultimately developing a much more thorough and accurate report.

Agentic RAG

The approach described above, multi-step reasoning with tools, is “agentic RAG,” and it’s the underlying foundation of Letta. Letta’s design takes all the work that’s gone into developing the search and doc retrieval tools we use today and embeds those tools into LLMs, preparing the context window by summarizing and organizing its “memory.”

With agentic RAG, LLMs can paginate through multiple pages of results, potentially even traversing an entire dataset, while also maintaining state. If you used agentic RAG to build a book report, the process would look like this:

Step 1: Read the first five pages.

Step 2: Write a summary of those five pages.

Step 3: Read the next five pages.

Step 4: Update the summary based on new information.

Agentic RAG’s iterative methodology updates results each time it retrieves and reviews more information, generating a more holistic and accurate response than if it retrieved information and generated responses only once (as in traditional RAG).

Agentic RAG also solves the reactivity problem.

With agentic RAG, an AI agent isn’t doing a top-K match and dump. It’s already distilled important data it’s received in the past (a customer’s favorite movie or favorite color) and organized it in such a way that the model can proactively relate it to the user’s prompt and curate a response: “You should have a Star Wars-themed party!”, or “You should get blue decorations!”

Conclusion

For companies that want to build robust AI agents, traditional RAG is insufficient. The key to developing AI agents that are precise, interpretable, proactive, and deeply aligned with an organization’s unique goals and data environments is to give your LLMs memory.

If you’re curious about how to do that, check out the Letta quickstart guide, or enroll in our Deep Learning AI course on agent memory. Or, if you’re ready to build your own sophisticated agents, request early access to the Letta Cloud platform.