Light
Dark
Research

Can Any Model Use Skills? Adding Skills to Context-Bench

November 7, 2025

To see how each model performs on skills, check the live Context-Bench leaderboard. To use skills with any model (Sonnet 4.5, GPT-5, GLM 4.6), install Letta Code (Research Preview).

As agents are deployed in the real world, it's impossible to endow them with everything they need to know ahead of time. Instead, agents must continually learn online, either by creating memories through experience, or by acquiring pre-made knowledge (or "skills") that extend their capabilities when needed. Skills enable agents to dynamically load specialized expertise, from statistical analysis methods to brand guidelines, without polluting their context with irrelevant information.

We recently released Context-Bench, which measures how well models can chain file operations and trace entity relationships to retrieve information. Today we're releasing Context-Bench Skills, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.

Our evaluation shows that many frontier models are quite capable at online skill acquisition with the right harness. As part of our evaluation, we've built skills into Letta Code, a model-agnostic harness that enables any LLM to use Anthropic's skill format. This means GPT-5, Gemini, and other models can now leverage the same skill library originally designed by Anthropic for Claude.

What are “Skills”? Enabling Agents to Acquire Context

Skills are mountable packages of specialized knowledge. Each skill is a directory containing a SKILL.md metadata file plus optional supplementary resources like datasets, scripts, and examples.

Importantly, an agent shouldn’t load every available skill at the start of a conversation - space in the context window is precious, so skills should only be mounted and unmounted when needed. An agent should know what skills are available, then only retrieve the ones relevant to the current task.

For example, an agent may have skills containing a corporate style guide that it can view when it becomes aware that it needs to write marketing content, or a skill with census data schemas that it loads only when analyzing demographic information.

Skills Are a Form of Context Mounting

Skills are an implementation of context mounting. Context mounting is Letta's term for managing specialized knowledge across the context hierarchy. Instead of loading everything upfront or searching through unstructured documents, agents follow the following process:

  1. Selection: Choose from available skills based on their metadata
  2. Mounting: Load the skill directory into the context window
  3. Execution: Apply the skill's knowledge to complete the task
  4. Unmounting: Free up context space when finished

Skills enable agents to have access to specialized expertise without overwhelming their context window. An agent might have dozens of skills available, but only consume tokens for the skills it actually mounts and uses.

Skill Structure

At their core, skills are just directories with files. Here's an example of a skill we created to help agents design effective Letta agents:

Agents using skills will always have access to the name (letta-agent-designer) as well as the description of what the skill does, and the agent can load the full contents of the skill file to fully understand how to design effective Letta agents only when it is working on a task related to designing a Letta agent.

But do skills actually work? The pattern is conceptually elegant, but has not yet been empirically measured. Anthropic released their skill library without accompanying evaluations about whether they improve agent performance. Do agents successfully identify when skills are relevant? Do they load the right skills at the right time? And crucially, does skill access improve task completion?

Adding Skills To Context-Bench

Today we’re releasing a new evaluation suite in Context-Bench, Context-Bench Skills, which measures the ability of an agent to utilize an available skills library to complete tasks. We use Anthropic’s open-source skills library (e.g. creating slack GIFs, algorithmic art, MCP servers) as a starting point to synthetically generate suitable tasks for each skill using a separate (LLM-based) task generator.

The key idea is that for a given task, there exists a relevant skill that will help the agent solve it - so if the agent is good at using skills, it will “acquire” the skill in order to complete the task. Specifically, given a randomly-selected skill at task generation time, the task generator:

  1. Loads the skill by reading its SKILL.MD and file structure for additional resources
  2. Generates a unique task that requires only the selected skill to complete adequately
  3. Creates rubrics to evaluate task completion and appropriate skill use

At test-time, we use a model-agnostic skills-compatible agent harness (Letta Code) to evaluate the agent’s ability to use skills to complete tasks. We evaluate agents in three settings:

  1. Baseline: The agent doesn’t have access to any skills.
  2. Skill Use: The agent provided the required skill metadata, but must still load relevant skill data (e.g. the skill body and files)
  3. Skills Selection & Use: The agent has access to all skills, so must both identify the required skill and use the skill 

We evaluate the agent’s performance on each task by measuring task completion and skill use (via an LLM-as-a-judge rubric), and aggregate performance across all tasks. Agents that are good at using “skill acquisition” should be able to correctly identify the correct skill from the library and dynamically load it into their context to complete the task. Agents that are bad at skill use will either fail to load the skill, or load too many skills at once, polluting the context window with irrelevant information.

Results: How Useful Are Skills? Evaluating Claude’s Skill Library

Anthropic released their own library of skills without any quantitative evaluation showing they improve agent performance. Our results confirm that skills do work — for Claude models that are effective at skill use, providing the relevant skill improves task completion by an average of 14.1%.

We find that frontier models successfully use skills regardless of whether they were explicitly trained for it. Non-Anthropic models like GPT-5 and GLM-4.6 (open weights) achieve similar performance gains from skills, demonstrating that skill acquisition is a general capability rather than a Claude-specific feature.

The ability to select the correct skill from a library is harder than using a skill that's already been identified. Among the models that are good at using skills, performance drops by approximately 6.5% when the models need to find the right skill first compared to when the relevant skill is provided directly. This gap suggests room for improvement in how models discover and prioritize skills.

Skill Are a Model-Agnostic Primitive

Powerful models like GLM-4.6 (open weights) and GPT-5 (closed weights) effectively use and select skills without any special training, suggesting that general-purpose capabilities are sufficient for skill acquisition. However, weaker models like GPT-5 Mini and GPT-5 Nano show negligible improvements from skills. Even when provided with the skill metadata, these models fail to properly load and apply the skill content to complete tasks. This creates a clear capability threshold — models need sufficient reasoning ability before they can benefit from skills at all.

Skills Enable Continual Learning

Skills enable continual learning by decoupling knowledge creation from agent initialization. When one agent develops a solution or learns a new pattern, that knowledge can be packaged as a skill and made available to other agents. This creates a shared knowledge ecosystem where agents learn from collective experience rather than starting from scratch. Unlike fine-tuning or retraining, skills can be created, tested, and deployed immediately — agents acquire new capabilities in real-time as skills become available, and can “shed” skills when they no longer are needed for the task, freeing up space in their context window.

Letta Code: A Model-Agnostic Harness For Skills

Claude Code was the only agent harness that supported skills — until now. To evaluate skill use across different models, we built skill support into Letta Code, our CLI tool which allows interacting with stateful agents running on a remote Letta server directly in your terminal (currently available as an open source research preview).

The implementation leverages Letta's existing agentic context engineering architecture. We added a dedicated skills memory block that maintains metadata about available skills. Since Letta's memory blocks are always visible to the agent, it can discover and load skills dynamically during execution. The agent sees the list of available skills in its memory, decides which ones to load based on the task, and uses standard file operations to read the skill contents.

This means any model — GPT-5, Gemini, GLM-4.6 — can now use the same skill libraries originally built for Claude. The skills themselves are just directories with markdown files and resources, making them completely portable across agent frameworks.

What’s Next

Our evaluation demonstrates that skill acquisition works today: frontier models can successfully identify, load, and use skills to complete tasks they couldn't solve otherwise. With Letta Code providing model-agnostic skill support, any LLM can now leverage the growing library of skills being developed by the community. As agents are deployed for longer-horizon real-world tasks, their ability to acquire knowledge online will determine whether they can adapt to new domains without constant retraining.

To learn more, check out:

Jul 7, 2025
Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025
Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025
Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025
RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025
Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024
The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024
New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024
Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024
MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Dec 1, 2025
Programmatic Tool Calling with any LLM

The Letta API now supports programmatic tool calling for any LLM model, enabling agents to generate their own workflows.

Oct 23, 2025
Letta Evals: Evaluating Agents that Learn

Introducing Letta Evals: an open-source evaluation framework for systematically testing stateful agents.

Oct 14, 2025
Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025
Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025
Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025
Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025
Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025
Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024
Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024
Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024
Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024
Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024
Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Dec 11, 2025
Continual Learning in Token Space

At Letta, we believe that learning in token space is the key to building AI agents that truly improve over time. Our interest in this problem is driven by a simple observation: agents that can carry their memories across model generations will outlast any single foundation model.

Dec 2, 2025
Skill Learning: Bringing Continual Learning to CLI Agents

Today we’re releasing Skill Learning, a way to dynamically learn skills through experience. With Skill Learning, agents can use their past experience to actually improve, rather than degrade, over time.

Oct 30, 2025
Context-Bench: Benchmarking LLMs on Agentic Context Engineering

We are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Aug 27, 2025
Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 12, 2025
Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

Aug 5, 2025
Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025
Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025
Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.