What Is a Three-Tier Retriever in AI Agent Memory and How Does It Inject Context?

Jun 22, 2026

I built an AI agent that could search documentation, run database queries, and deploy code. It worked great for the first few steps. But after ten or twenty interactions, the context window stuffed up and the agent started forgetting what it just did. Everything degraded.

The standard fix is to dump all historical data into the prompt. But that burns tokens and confuses the model — it can’t tell which past experience is relevant for the current step.

I found a better approach: a three-tier retriever that stops at the first match.

The Problem

An AI agent running a multi-step task needs relevant past experience to make good decisions. Without it, the agent repeats mistakes and wastes time.

But you can’t stuff everything into the context window. LLMs have a limited budget of attention — fill it with noise and the signal gets lost. I tried flat retrieval where I just searched everything at once, and the model started mixing up information from unrelated tasks.

The core issue is: how do you give the agent exactly the right context, at the right time, without waste?

The Three-Tier Solution

Agent needs context
        │
        ▼
┌─────────────────────────────────────────────┐
│  Tier 1: Skill                               │
│  Best case: callable capability by name       │
│  Cost: O(1) — just invocation                │
└─────────────────────────────────────────────┘
        │ fail
        ▼
┌─────────────────────────────────────────────┐
│  Tier 2: Trace / Episode                     │
│  Search step-level traces for similar past   │
│  Cost: search + inject relevant episodes     │
└─────────────────────────────────────────────┘
        │ fail
        ▼
┌─────────────────────────────────────────────┐
│  Tier 3: World Model                         │
│  Query compressed environmental cognition    │
│  Cost: decompress + generate context         │
└─────────────────────────────────────────────┘
        │
        ▼
Inject context → continue inference

The rule is simple: first matching tier wins. You check Skills first because they’re the fastest and most refined. If no Skill matches, you search past trace/episode data. If nothing there either, you fall back to the World Model for general principles.

Tier 1 — Skill

Skills are crystallized, callable capabilities the agent can invoke directly. Think of them as the agent’s proven toolbox — consistent, tested, and low-cost.

Agent receives task: "Process this error log"

Check Skills:
  "error_analysis" skill exists? → YES
  → Invoke directly, no context needed

The agent doesn’t search anything. It just calls the skill by name. This is the fastest path.

Tier 2 — Trace / Episode

When no Skill matches, the agent searches past step-level traces for similar episodes. These are raw records of what the agent did before, what it observed, and what it learned.

No matching skill found

Search L1 traces:
  Similar error logs found? → YES
  → Inject relevant past episodes into context

The key difference from flat retrieval: you only search traces when Skills fail. This keeps the search space small and the results relevant.

Tier 3 — World Model

The World Model is compressed environmental cognition — a high-level understanding of how the system works, derived from the aggregation of all past traces and learned policies.

No matching traces found

Query L3 World Model:
  → Inject compressed knowledge about error patterns

This is the fallback. It’s the most generic tier, but it prevents the agent from going in completely blind.

Why the Priority Order Matters

I think the most important design decision is the stopping rule. Once a tier returns useful context, you stop. You don’t go deeper. This gives you:

Efficiency. Skills cost almost nothing. Traces cost more but are still fast. World Model queries are the most expensive. By checking them in order, most requests resolve at the cheapest tier.

Precision. Each tier returns increasingly general context. Skills are precise. Traces are specific to similar past tasks. World Model is broad. You get the narrowest possible match.

Graceful degradation. If the agent has no relevant experience, it still gets something from the World Model. It never runs a task with zero context.

Privacy. The retriever runs local-first, file-backed, with no external API calls. All data stays on-device.

Common Mistakes

I’ve seen a few pitfalls when people implement this pattern.

Flat retrieval. Querying all three tiers at once and concatenating the results. This defeats the purpose — you pay the cost of the most expensive tier every time, and you dump mixed-granularity context that confuses the model.

Not crystallizing Skills. Leaving proven patterns as raw traces instead of promoting them to Skills. The retriever never gets to use the fastest path.

Over-relying on the World Model. Using it as the primary source instead of the fallback. The World Model is compressed and lossy — it should be your last resort, not your first.

Summary

In this post, I explained how a three-tier retriever (Skill → Trace/Episode → World Model) injects context at inference time for AI agents. The key design choice is checking the fastest source first and stopping at the first match. Skills provide instant invocation, Traces give specific past experience, and the World Model offers compressed fallback knowledge. This keeps token usage low while ensuring the agent always has relevant context.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 @memtensor/memos-local-plugin on GitHub
👨‍💻 Hermes Agent

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!