How to Implement Persistent Memory for AI Agents: A Practical Guide

Mar 21, 2026

My AI agent forgot everything again. After a 30-minute conversation where we debugged a complex authentication flow, I ended the session. The next day, I started a new session and asked about the solution we found.

Blank stare. “I don’t have any context about that authentication issue you’re referring to.”

I had to spend another 20 minutes and 30K+ tokens re-explaining the entire context. This wasn’t just frustrating—it was a fundamental problem with how AI agents handle memory.

The Problem: Session Amnesia

The core issue is simple: AI agents have no persistent memory between sessions. Every conversation starts from zero. This creates three major problems:

Context loss: All the context you built up is gone
Token waste: Re-explaining the same information costs money and time
Broken continuity: You can’t build on previous work

I tried several “solutions” that didn’t work. Let me walk through what I attempted and why each failed.

Failed Approach #1: The MEMORY.md File

My first attempt was straightforward: maintain a MEMORY.md file that stores important context.

# Project Context

## Authentication Flow (2026-03-15)
- Using JWT with refresh tokens
- Token expiry: 15 minutes
- Refresh token stored in HTTP-only cookie
- Issue: CORS problems with credentials

## Database Schema (2026-03-18)
- Users table with role-based access
- Posts table with soft deletes
- Need to add: audit log table

This worked for about a week. Then problems emerged:

The Overflow Problem: The file grew to 50KB. Loading it into context consumed 12K tokens every session.

The Pruning Problem: I had to manually decide what to keep. Delete too much? Lose valuable context. Keep too much? Context window explodes.

The Retrieval Problem: The agent had to read the entire file to find relevant information. No structure, no indexing.

I realized a flat file approach fundamentally doesn’t scale. The agent needs something smarter.

Failed Approach #2: RAG / Vector Search

“Use RAG!” everyone said. “Vector search is the solution!”

I implemented a vector database with embeddings for all past conversations. Here’s the basic setup:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document

class MemoryRAG:
    def __init__(self, persist_directory: str):
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = Chroma(
            embedding_function=self.embeddings,
            persist_directory=persist_directory
        )

    def store_conversation(self, conversation: str, metadata: dict):
        """Store a conversation chunk in vector database."""
        doc = Document(page_content=conversation, metadata=metadata)
        self.vectorstore.add_documents([doc])

    def recall(self, query: str, k: int = 5) -> list[Document]:
        """Retrieve relevant conversations."""
        return self.vectorstore.similarity_search(query, k=k)

This worked great for queries like:

“What was the authentication solution?”
“How did we fix the CORS error?”

But there was a critical flaw I discovered when I asked: “Do we have any context about the payment system integration?”

The agent responded: “Let me search for payment system information…”

It found nothing. But we HAD discussed payments three weeks ago. The problem? I didn’t know the right search terms. The agent didn’t know what it knew.

The Awareness Problem: RAG answers “find X” queries perfectly. But it fails at “do I know anything about X?” awareness questions. The agent can’t tell “I know this” from “I’ve never seen this” without loading everything or searching for something specific.

Vector search solves retrieval, not awareness.

Failed Approach #3: Just Use Larger Context Windows

“GPT-4 has 128K context! GPT-4 Turbo has 128K! Just load everything!”

I tried loading the entire conversation history into context. Here’s what happened:

Attention Degradation: Studies show model performance degrades significantly when relevant information is buried in long contexts. The agent “forgets” earlier context.

Cost Explosion: A 100K token context window costs significant money. Even with price drops, this isn’t sustainable for daily use.

Practical Limits: Even 128K isn’t infinite. Long-running projects exceed this quickly.

Large context windows help, but they’re not a memory solution—they’re a bandage.

The Real Solution: Hierarchical Compaction

After months of frustration, I found an approach that actually works: a compaction tree that maintains topic awareness without loading everything.

The Core Insight

The key insight from Reddit discussions and research papers: you need an index of what you know, not the actual content. The agent needs to answer “do I have context on X?” in milliseconds, not by searching through thousands of tokens.

The Compaction Tree Structure

Imagine conversation history flowing through a compression pipeline:

┌─────────────────────────────────────────────────────────┐
│                      ROOT.md                            │
│  (Topic Index - ~3K tokens, loaded at session start)    │
│  Contains: High-level summaries, topic keywords, dates  │
└───────────────────────┬─────────────────────────────────┘
                        │
        ┌───────────────┼───────────────┐
        ▼               ▼               ▼
   ┌─────────┐    ┌─────────┐    ┌─────────┐
   │Monthly 1│    │Monthly 2│    │Monthly 3│
   │Summary  │    │Summary  │    │Summary  │
   └────┬────┘    └────┬────┘    └────┬────┘
        │               │               │
   ┌────┴────┐    ┌────┴────┐    ┌────┴────┐
   ▼         ▼    ▼         ▼    ▼         ▼
 Weekly    Weekly  Weekly    Weekly  Weekly    Weekly
   │         │      │         │      │         │
   ▼         ▼      ▼         ▼      ▼         ▼
 Daily     Daily   Daily    Daily   Daily     Daily
   │         │      │         │      │         │
   ▼         ▼      ▼         ▼      ▼         ▼
  Raw       Raw    Raw      Raw    Raw       Raw
Conversation Chunks

Level Definitions

Level 0 - Raw: The actual conversation chunks, token-limited segments of the original dialogue.

Level 1 - Daily: A summary of all conversations from one day, highlighting key decisions, questions, and outcomes.

Level 2 - Weekly: A compressed summary of the week’s daily summaries, extracting patterns and major themes.

Level 3 - Monthly: High-level overview of the month, capturing major milestones and project evolution.

Level 4 - Root: The topic index loaded at session start. Contains just enough information to answer “do I know about X?”

Implementation Strategy

Here’s how I implemented this in Python:

from datetime import datetime, timedelta
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
import json

@dataclass
class ConversationChunk:
    """Raw conversation segment."""
    id: str
    timestamp: datetime
    content: str
    tokens: int
    topics: list[str]

@dataclass
class CompactedSummary:
    """Compressed summary at any level."""
    level: int  # 1=daily, 2=weekly, 3=monthly, 4=root
    period_start: datetime
    period_end: datetime
    summary: str
    topics: list[str]
    source_ids: list[str]  # IDs of summarized items
    token_count: int

class CompactionTree:
    def __init__(self, base_path: Path):
        self.base_path = base_path
        self.raw_path = base_path / "raw"
        self.daily_path = base_path / "daily"
        self.weekly_path = base_path / "weekly"
        self.monthly_path = base_path / "monthly"
        self.root_path = base_path / "ROOT.md"

        # Create directories
        for path in [self.raw_path, self.daily_path,
                     self.weekly_path, self.monthly_path]:
            path.mkdir(parents=True, exist_ok=True)

    def add_conversation(self, chunk: ConversationChunk):
        """Add a new conversation chunk to the tree."""
        # Store raw chunk
        chunk_file = self.raw_path / f"{chunk.id}.json"
        chunk_file.write_text(json.dumps({
            "id": chunk.id,
            "timestamp": chunk.timestamp.isoformat(),
            "content": chunk.content,
            "tokens": chunk.tokens,
            "topics": chunk.topics
        }))

        # Trigger compaction check
        self._check_compaction_needed()

    def get_root_context(self) -> str:
        """Load the ROOT.md for session start."""
        if self.root_path.exists():
            return self.root_path.read_text()
        return "# No previous context\n\nThis is a fresh session."

    def find_relevant_context(self, topic: str) -> list[str]:
        """Find all context related to a topic."""
        # First check ROOT for awareness
        root = self.get_root_context()
        if topic.lower() not in root.lower():
            return []  # Agent knows it doesn't know this

        # Topic exists, now load relevant summaries
        results = []
        for monthly in self.monthly_path.glob("*.json"):
            data = json.loads(monthly.read_text())
            if topic.lower() in data["summary"].lower():
                results.append(data["summary"])

        return results

The ROOT.md Structure

The ROOT.md file is critical. Here’s an example:

# Project Context Index
Last Updated: 2026-03-21

## Active Topics

### Authentication (Last: 2026-03-15)
- JWT with refresh tokens implemented
- CORS issues resolved with credentials
- See: daily/2026-03-15.json

### Database Design (Last: 2026-03-18)
- Schema with RBAC complete
- Soft deletes implemented
- Audit log pending
- See: weekly/2026-W11.json

### Payment Integration (Last: 2026-02-28)
- Stripe checkout configured
- Webhook handling implemented
- See: monthly/2026-02.json

## Inactive Topics

### Logging System (Last: 2026-01-10)
- Basic logging complete
- Structured logging pending
- See: monthly/2026-01.json

This ~3K token file gives the agent instant awareness of everything it knows.

Compaction Algorithm

The compaction process runs periodically:

from llm import summarize_with_llm  # Your LLM call

def compact_level(sources: list, target_level: int,
                  period: tuple[datetime, datetime]) -> CompactedSummary:
    """Compact multiple sources into a higher-level summary."""

    # Combine source content
    combined = "\n\n---\n\n".join([
        f"[{s.timestamp}]\n{s.summary}"
        for s in sources
    ])

    # Use LLM to create hierarchical summary
    prompt = f"""Create a {['', 'daily', 'weekly', 'monthly'][target_level]}
    summary of these conversations from {period[0]} to {period[1]}.

    Focus on:
    1. Key decisions made
    2. Problems solved
    3. Pending items
    4. Topics covered (for indexing)

    Conversations:
    {combined}
    """

    summary_text = summarize_with_llm(prompt, max_tokens=1000)

    # Extract topics for indexing
    topics = extract_topics(summary_text)

    return CompactedSummary(
        level=target_level,
        period_start=period[0],
        period_end=period[1],
        summary=summary_text,
        topics=topics,
        source_ids=[s.id for s in sources],
        token_count=count_tokens(summary_text)
    )

Session Start Protocol

When a new session begins, here’s the protocol:

def start_session(agent, tree: CompactionTree) -> dict:
    """Initialize a new session with persistent memory."""

    # 1. Load ROOT.md (~3K tokens)
    root_context = tree.get_root_context()

    # 2. Inject into system prompt
    system_prompt = f"""You are an AI assistant with persistent memory.

    CONTEXT INDEX (you have memory of these topics):
    {root_context}

    When the user asks about something:
    1. Check if it's in the context index
    2. If yes, say "I have context on this from [date]. Let me retrieve details."
    3. If no, say "I don't have previous context on this topic."

    Never claim ignorance for topics in the index.
    Never claim knowledge for topics not in the index.
    """

    # 3. Ready to receive queries
    return {
        "system_prompt": system_prompt,
        "root_context": root_context,
        "tree": tree  # For on-demand retrieval
    }

def retrieve_details(tree: CompactionTree, topic: str) -> str:
    """Retrieve detailed context when needed."""

    # Walk down the tree from monthly to daily
    context_parts = []

    # Load relevant monthly summary
    for monthly_file in tree.monthly_path.glob("*.json"):
        data = json.loads(monthly_file.read_text())
        if topic.lower() in str(data["topics"]).lower():
            context_parts.append(f"Monthly Summary:\n{data['summary']}")

            # Load relevant weekly summaries
            for week_id in data.get("weekly_ids", []):
                week_file = tree.weekly_path / f"{week_id}.json"
                if week_file.exists():
                    week_data = json.loads(week_file.read_text())
                    if topic.lower() in str(week_data["topics"]).lower():
                        context_parts.append(f"\nWeekly Detail:\n{week_data['summary']}")

    return "\n\n".join(context_parts) if context_parts else "No detailed context found."

Why This Works

The compaction tree solves all three core problems:

Problem 1: Context Loss

Solution: ROOT.md provides instant awareness of all past topics
Agent knows “I have context on authentication” without loading everything

Problem 2: Token Waste

Solution: Only load what’s needed (ROOT at start, details on demand)
Typical session: 3K tokens for awareness + 5K for relevant details = 8K total
Compare to: 30K+ for re-explaining or 100K+ for full context

Problem 3: Broken Continuity

Solution: Hierarchical structure preserves relationships between topics
Weekly summaries maintain thread connections
Monthly summaries show project evolution

Practical Considerations

When to Compact

Don’t compact immediately after every conversation. I use these triggers:

def should_compact(tree: CompactionTree) -> tuple[bool, str]:
    """Determine if compaction is needed."""

    # Daily: Compact at end of day or when >10 raw chunks exist
    raw_count = len(list(tree.raw_path.glob("*.json")))
    if raw_count > 10:
        return True, "daily"

    # Weekly: Compact on Sunday or when >7 daily summaries exist
    daily_count = len(list(tree.daily_path.glob("*.json")))
    if daily_count > 7 or datetime.now().weekday() == 6:
        return True, "weekly"

    # Monthly: Compact on 1st or when >4 weekly summaries exist
    weekly_count = len(list(tree.weekly_path.glob("*.json")))
    if weekly_count > 4 or datetime.now().day == 1:
        return True, "monthly"

    return False, ""

Handling Topic Drift

Topics evolve. The ROOT.md needs periodic rebuilding:

def rebuild_root(tree: CompactionTree):
    """Rebuild ROOT.md from monthly summaries."""

    topics = {}  # topic -> {last_seen, monthly_id, summary}

    # Scan all monthly summaries
    for monthly_file in sorted(tree.monthly_path.glob("*.json")):
        data = json.loads(monthly_file.read_text())

        for topic in data["topics"]:
            if topic not in topics:
                topics[topic] = {
                    "last_seen": data["period_end"],
                    "monthly_id": monthly_file.stem,
                    "summary": data["summary"][:200]  # Preview
                }
            else:
                topics[topic]["last_seen"] = data["period_end"]

    # Generate ROOT.md
    active = {k: v for k, v in topics.items()
              if (datetime.now() - v["last_seen"]).days < 30}
    inactive = {k: v for k, v in topics.items()
                if (datetime.now() - v["last_seen"]).days >= 30}

    root_content = generate_root_markdown(active, inactive)
    tree.root_path.write_text(root_content)

Storage Requirements

Here’s what I’ve seen in practice:

Raw conversations:  ~500MB (50MB compressed)
Daily summaries:     ~50MB  (5MB compressed)
Weekly summaries:    ~5MB   (500KB compressed)
Monthly summaries:   ~500KB (50KB compressed)
ROOT.md:            ~15KB  (3K tokens)

Total:              ~555MB (55MB with compression)

The compression ratio improves at each level, and only ROOT.md needs to be in memory at session start.

Integration with LangChain

Here’s how to integrate with LangChain’s memory system:

from langchain.schema import BaseMemory
from typing import Any, Dict, List

class CompactionTreeMemory(BaseMemory):
    """LangChain memory backed by compaction tree."""

    tree: CompactionTree
    session_context: str = ""

    @property
    def memory_variables(self) -> List[str]:
        return ["context_index", "relevant_context"]

    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """Load memory variables for prompt."""

        # Always provide the root index
        result = {"context_index": self.tree.get_root_context()}

        # Check if user query relates to known topics
        user_input = inputs.get("input", "")
        relevant = self.tree.find_relevant_context(user_input)

        if relevant:
            result["relevant_context"] = "\n\n".join(relevant)
        else:
            result["relevant_context"] = "No relevant previous context."

        return result

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, Any]) -> None:
        """Save conversation to memory."""

        chunk = ConversationChunk(
            id=generate_id(),
            timestamp=datetime.now(),
            content=f"User: {inputs.get('input', '')}\nAI: {outputs.get('output', '')}",
            tokens=estimate_tokens(inputs.get('input', '') + outputs.get('output', '')),
            topics=extract_topics(inputs.get('input', ''))
        )

        self.tree.add_conversation(chunk)

    def clear(self) -> None:
        """Clear session context (not persistent memory)."""
        self.session_context = ""

Limitations and Trade-offs

This approach isn’t perfect:

Information Loss: Each compaction level loses detail. Important specifics might be compressed away. Mitigate by keeping raw logs for manual review.

Recency Bias: The topic extraction might miss older but relevant context. Mitigate by occasionally scanning ROOT.md for connections.

LLM Dependency: Compaction requires LLM calls. Budget for this. I spend about $2/month on compaction for daily use.

Implementation Complexity: More complex than MEMORY.md or basic RAG. But the complexity buys you real awareness.

Comparison to Hosted Solutions

You might ask: “Why not use hosted memory services?”

Hosted services like MemGPT or Letta provide similar functionality, but consider:

Dependencies: Another API key, another service to monitor Costs: Subscription fees on top of LLM costs Data Control: Your conversation history on someone else’s servers Customization: Limited ability to tune compaction strategies

If these concerns matter to you, a self-hosted compaction tree is worth the implementation effort.

Summary

The compaction tree approach works because it separates awareness from retrieval:

ROOT.md provides instant topic awareness at session start
Hierarchical compression maintains context relationships
On-demand retrieval loads only relevant details
Token costs stay manageable (~8K per session vs 100K+ for alternatives)

The key insight: agents don’t need perfect memory—they need to know what they know, and be able to retrieve it when needed.

This approach transformed my AI agent workflow from “explain everything every session” to “continue from where we left off.” The time and token savings are substantial, but more importantly, the collaboration feels natural and continuous.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!