Skip to content

How to Implement Persistent Memory for AI Agents: A Practical Guide

My AI agent forgot everything again. After a 30-minute conversation where we debugged a complex authentication flow, I ended the session. The next day, I started a new session and asked about the solution we found.

Blank stare. “I don’t have any context about that authentication issue you’re referring to.”

I had to spend another 20 minutes and 30K+ tokens re-explaining the entire context. This wasn’t just frustrating—it was a fundamental problem with how AI agents handle memory.

The Problem: Session Amnesia

The core issue is simple: AI agents have no persistent memory between sessions. Every conversation starts from zero. This creates three major problems:

  1. Context loss: All the context you built up is gone
  2. Token waste: Re-explaining the same information costs money and time
  3. Broken continuity: You can’t build on previous work

I tried several “solutions” that didn’t work. Let me walk through what I attempted and why each failed.

Failed Approach #1: The MEMORY.md File

My first attempt was straightforward: maintain a MEMORY.md file that stores important context.

MEMORY.md
# Project Context
## Authentication Flow (2026-03-15)
- Using JWT with refresh tokens
- Token expiry: 15 minutes
- Refresh token stored in HTTP-only cookie
- Issue: CORS problems with credentials
## Database Schema (2026-03-18)
- Users table with role-based access
- Posts table with soft deletes
- Need to add: audit log table

This worked for about a week. Then problems emerged:

The Overflow Problem: The file grew to 50KB. Loading it into context consumed 12K tokens every session.

The Pruning Problem: I had to manually decide what to keep. Delete too much? Lose valuable context. Keep too much? Context window explodes.

The Retrieval Problem: The agent had to read the entire file to find relevant information. No structure, no indexing.

I realized a flat file approach fundamentally doesn’t scale. The agent needs something smarter.

“Use RAG!” everyone said. “Vector search is the solution!”

I implemented a vector database with embeddings for all past conversations. Here’s the basic setup:

memory_rag.py
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
class MemoryRAG:
def __init__(self, persist_directory: str):
self.embeddings = OpenAIEmbeddings()
self.vectorstore = Chroma(
embedding_function=self.embeddings,
persist_directory=persist_directory
)
def store_conversation(self, conversation: str, metadata: dict):
"""Store a conversation chunk in vector database."""
doc = Document(page_content=conversation, metadata=metadata)
self.vectorstore.add_documents([doc])
def recall(self, query: str, k: int = 5) -> list[Document]:
"""Retrieve relevant conversations."""
return self.vectorstore.similarity_search(query, k=k)

This worked great for queries like:

  • “What was the authentication solution?”
  • “How did we fix the CORS error?”

But there was a critical flaw I discovered when I asked: “Do we have any context about the payment system integration?”

The agent responded: “Let me search for payment system information…”

It found nothing. But we HAD discussed payments three weeks ago. The problem? I didn’t know the right search terms. The agent didn’t know what it knew.

The Awareness Problem: RAG answers “find X” queries perfectly. But it fails at “do I know anything about X?” awareness questions. The agent can’t tell “I know this” from “I’ve never seen this” without loading everything or searching for something specific.

Vector search solves retrieval, not awareness.

Failed Approach #3: Just Use Larger Context Windows

“GPT-4 has 128K context! GPT-4 Turbo has 128K! Just load everything!”

I tried loading the entire conversation history into context. Here’s what happened:

Attention Degradation: Studies show model performance degrades significantly when relevant information is buried in long contexts. The agent “forgets” earlier context.

Cost Explosion: A 100K token context window costs significant money. Even with price drops, this isn’t sustainable for daily use.

Practical Limits: Even 128K isn’t infinite. Long-running projects exceed this quickly.

Large context windows help, but they’re not a memory solution—they’re a bandage.

The Real Solution: Hierarchical Compaction

After months of frustration, I found an approach that actually works: a compaction tree that maintains topic awareness without loading everything.

The Core Insight

The key insight from Reddit discussions and research papers: you need an index of what you know, not the actual content. The agent needs to answer “do I have context on X?” in milliseconds, not by searching through thousands of tokens.

The Compaction Tree Structure

Imagine conversation history flowing through a compression pipeline:

Compaction Tree Overview
┌─────────────────────────────────────────────────────────┐
│ ROOT.md │
│ (Topic Index - ~3K tokens, loaded at session start) │
│ Contains: High-level summaries, topic keywords, dates │
└───────────────────────┬─────────────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Monthly 1│ │Monthly 2│ │Monthly 3│
│Summary │ │Summary │ │Summary │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
▼ ▼ ▼ ▼ ▼ ▼
Weekly Weekly Weekly Weekly Weekly Weekly
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Daily Daily Daily Daily Daily Daily
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Raw Raw Raw Raw Raw Raw
Conversation Chunks

Level Definitions

Level 0 - Raw: The actual conversation chunks, token-limited segments of the original dialogue.

Level 1 - Daily: A summary of all conversations from one day, highlighting key decisions, questions, and outcomes.

Level 2 - Weekly: A compressed summary of the week’s daily summaries, extracting patterns and major themes.

Level 3 - Monthly: High-level overview of the month, capturing major milestones and project evolution.

Level 4 - Root: The topic index loaded at session start. Contains just enough information to answer “do I know about X?”

Implementation Strategy

Here’s how I implemented this in Python:

compaction_tree.py
from datetime import datetime, timedelta
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class ConversationChunk:
"""Raw conversation segment."""
id: str
timestamp: datetime
content: str
tokens: int
topics: list[str]
@dataclass
class CompactedSummary:
"""Compressed summary at any level."""
level: int # 1=daily, 2=weekly, 3=monthly, 4=root
period_start: datetime
period_end: datetime
summary: str
topics: list[str]
source_ids: list[str] # IDs of summarized items
token_count: int
class CompactionTree:
def __init__(self, base_path: Path):
self.base_path = base_path
self.raw_path = base_path / "raw"
self.daily_path = base_path / "daily"
self.weekly_path = base_path / "weekly"
self.monthly_path = base_path / "monthly"
self.root_path = base_path / "ROOT.md"
# Create directories
for path in [self.raw_path, self.daily_path,
self.weekly_path, self.monthly_path]:
path.mkdir(parents=True, exist_ok=True)
def add_conversation(self, chunk: ConversationChunk):
"""Add a new conversation chunk to the tree."""
# Store raw chunk
chunk_file = self.raw_path / f"{chunk.id}.json"
chunk_file.write_text(json.dumps({
"id": chunk.id,
"timestamp": chunk.timestamp.isoformat(),
"content": chunk.content,
"tokens": chunk.tokens,
"topics": chunk.topics
}))
# Trigger compaction check
self._check_compaction_needed()
def get_root_context(self) -> str:
"""Load the ROOT.md for session start."""
if self.root_path.exists():
return self.root_path.read_text()
return "# No previous context\n\nThis is a fresh session."
def find_relevant_context(self, topic: str) -> list[str]:
"""Find all context related to a topic."""
# First check ROOT for awareness
root = self.get_root_context()
if topic.lower() not in root.lower():
return [] # Agent knows it doesn't know this
# Topic exists, now load relevant summaries
results = []
for monthly in self.monthly_path.glob("*.json"):
data = json.loads(monthly.read_text())
if topic.lower() in data["summary"].lower():
results.append(data["summary"])
return results

The ROOT.md Structure

The ROOT.md file is critical. Here’s an example:

ROOT.md
# Project Context Index
Last Updated: 2026-03-21
## Active Topics
### Authentication (Last: 2026-03-15)
- JWT with refresh tokens implemented
- CORS issues resolved with credentials
- See: daily/2026-03-15.json
### Database Design (Last: 2026-03-18)
- Schema with RBAC complete
- Soft deletes implemented
- Audit log pending
- See: weekly/2026-W11.json
### Payment Integration (Last: 2026-02-28)
- Stripe checkout configured
- Webhook handling implemented
- See: monthly/2026-02.json
## Inactive Topics
### Logging System (Last: 2026-01-10)
- Basic logging complete
- Structured logging pending
- See: monthly/2026-01.json

This ~3K token file gives the agent instant awareness of everything it knows.

Compaction Algorithm

The compaction process runs periodically:

compactor.py
from llm import summarize_with_llm # Your LLM call
def compact_level(sources: list, target_level: int,
period: tuple[datetime, datetime]) -> CompactedSummary:
"""Compact multiple sources into a higher-level summary."""
# Combine source content
combined = "\n\n---\n\n".join([
f"[{s.timestamp}]\n{s.summary}"
for s in sources
])
# Use LLM to create hierarchical summary
prompt = f"""Create a {['', 'daily', 'weekly', 'monthly'][target_level]}
summary of these conversations from {period[0]} to {period[1]}.
Focus on:
1. Key decisions made
2. Problems solved
3. Pending items
4. Topics covered (for indexing)
Conversations:
{combined}
"""
summary_text = summarize_with_llm(prompt, max_tokens=1000)
# Extract topics for indexing
topics = extract_topics(summary_text)
return CompactedSummary(
level=target_level,
period_start=period[0],
period_end=period[1],
summary=summary_text,
topics=topics,
source_ids=[s.id for s in sources],
token_count=count_tokens(summary_text)
)

Session Start Protocol

When a new session begins, here’s the protocol:

session.py
def start_session(agent, tree: CompactionTree) -> dict:
"""Initialize a new session with persistent memory."""
# 1. Load ROOT.md (~3K tokens)
root_context = tree.get_root_context()
# 2. Inject into system prompt
system_prompt = f"""You are an AI assistant with persistent memory.
CONTEXT INDEX (you have memory of these topics):
{root_context}
When the user asks about something:
1. Check if it's in the context index
2. If yes, say "I have context on this from [date]. Let me retrieve details."
3. If no, say "I don't have previous context on this topic."
Never claim ignorance for topics in the index.
Never claim knowledge for topics not in the index.
"""
# 3. Ready to receive queries
return {
"system_prompt": system_prompt,
"root_context": root_context,
"tree": tree # For on-demand retrieval
}
def retrieve_details(tree: CompactionTree, topic: str) -> str:
"""Retrieve detailed context when needed."""
# Walk down the tree from monthly to daily
context_parts = []
# Load relevant monthly summary
for monthly_file in tree.monthly_path.glob("*.json"):
data = json.loads(monthly_file.read_text())
if topic.lower() in str(data["topics"]).lower():
context_parts.append(f"Monthly Summary:\n{data['summary']}")
# Load relevant weekly summaries
for week_id in data.get("weekly_ids", []):
week_file = tree.weekly_path / f"{week_id}.json"
if week_file.exists():
week_data = json.loads(week_file.read_text())
if topic.lower() in str(week_data["topics"]).lower():
context_parts.append(f"\nWeekly Detail:\n{week_data['summary']}")
return "\n\n".join(context_parts) if context_parts else "No detailed context found."

Why This Works

The compaction tree solves all three core problems:

Problem 1: Context Loss

  • Solution: ROOT.md provides instant awareness of all past topics
  • Agent knows “I have context on authentication” without loading everything

Problem 2: Token Waste

  • Solution: Only load what’s needed (ROOT at start, details on demand)
  • Typical session: 3K tokens for awareness + 5K for relevant details = 8K total
  • Compare to: 30K+ for re-explaining or 100K+ for full context

Problem 3: Broken Continuity

  • Solution: Hierarchical structure preserves relationships between topics
  • Weekly summaries maintain thread connections
  • Monthly summaries show project evolution

Practical Considerations

When to Compact

Don’t compact immediately after every conversation. I use these triggers:

compaction_triggers.py
def should_compact(tree: CompactionTree) -> tuple[bool, str]:
"""Determine if compaction is needed."""
# Daily: Compact at end of day or when >10 raw chunks exist
raw_count = len(list(tree.raw_path.glob("*.json")))
if raw_count > 10:
return True, "daily"
# Weekly: Compact on Sunday or when >7 daily summaries exist
daily_count = len(list(tree.daily_path.glob("*.json")))
if daily_count > 7 or datetime.now().weekday() == 6:
return True, "weekly"
# Monthly: Compact on 1st or when >4 weekly summaries exist
weekly_count = len(list(tree.weekly_path.glob("*.json")))
if weekly_count > 4 or datetime.now().day == 1:
return True, "monthly"
return False, ""

Handling Topic Drift

Topics evolve. The ROOT.md needs periodic rebuilding:

root_rebuilder.py
def rebuild_root(tree: CompactionTree):
"""Rebuild ROOT.md from monthly summaries."""
topics = {} # topic -> {last_seen, monthly_id, summary}
# Scan all monthly summaries
for monthly_file in sorted(tree.monthly_path.glob("*.json")):
data = json.loads(monthly_file.read_text())
for topic in data["topics"]:
if topic not in topics:
topics[topic] = {
"last_seen": data["period_end"],
"monthly_id": monthly_file.stem,
"summary": data["summary"][:200] # Preview
}
else:
topics[topic]["last_seen"] = data["period_end"]
# Generate ROOT.md
active = {k: v for k, v in topics.items()
if (datetime.now() - v["last_seen"]).days < 30}
inactive = {k: v for k, v in topics.items()
if (datetime.now() - v["last_seen"]).days >= 30}
root_content = generate_root_markdown(active, inactive)
tree.root_path.write_text(root_content)

Storage Requirements

Here’s what I’ve seen in practice:

Storage Metrics (6 months of daily use)
Raw conversations: ~500MB (50MB compressed)
Daily summaries: ~50MB (5MB compressed)
Weekly summaries: ~5MB (500KB compressed)
Monthly summaries: ~500KB (50KB compressed)
ROOT.md: ~15KB (3K tokens)
Total: ~555MB (55MB with compression)

The compression ratio improves at each level, and only ROOT.md needs to be in memory at session start.

Integration with LangChain

Here’s how to integrate with LangChain’s memory system:

langchain_memory.py
from langchain.schema import BaseMemory
from typing import Any, Dict, List
class CompactionTreeMemory(BaseMemory):
"""LangChain memory backed by compaction tree."""
tree: CompactionTree
session_context: str = ""
@property
def memory_variables(self) -> List[str]:
return ["context_index", "relevant_context"]
def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
"""Load memory variables for prompt."""
# Always provide the root index
result = {"context_index": self.tree.get_root_context()}
# Check if user query relates to known topics
user_input = inputs.get("input", "")
relevant = self.tree.find_relevant_context(user_input)
if relevant:
result["relevant_context"] = "\n\n".join(relevant)
else:
result["relevant_context"] = "No relevant previous context."
return result
def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, Any]) -> None:
"""Save conversation to memory."""
chunk = ConversationChunk(
id=generate_id(),
timestamp=datetime.now(),
content=f"User: {inputs.get('input', '')}\nAI: {outputs.get('output', '')}",
tokens=estimate_tokens(inputs.get('input', '') + outputs.get('output', '')),
topics=extract_topics(inputs.get('input', ''))
)
self.tree.add_conversation(chunk)
def clear(self) -> None:
"""Clear session context (not persistent memory)."""
self.session_context = ""

Limitations and Trade-offs

This approach isn’t perfect:

Information Loss: Each compaction level loses detail. Important specifics might be compressed away. Mitigate by keeping raw logs for manual review.

Recency Bias: The topic extraction might miss older but relevant context. Mitigate by occasionally scanning ROOT.md for connections.

LLM Dependency: Compaction requires LLM calls. Budget for this. I spend about $2/month on compaction for daily use.

Implementation Complexity: More complex than MEMORY.md or basic RAG. But the complexity buys you real awareness.

Comparison to Hosted Solutions

You might ask: “Why not use hosted memory services?”

Hosted services like MemGPT or Letta provide similar functionality, but consider:

Dependencies: Another API key, another service to monitor Costs: Subscription fees on top of LLM costs Data Control: Your conversation history on someone else’s servers Customization: Limited ability to tune compaction strategies

If these concerns matter to you, a self-hosted compaction tree is worth the implementation effort.

Summary

The compaction tree approach works because it separates awareness from retrieval:

  1. ROOT.md provides instant topic awareness at session start
  2. Hierarchical compression maintains context relationships
  3. On-demand retrieval loads only relevant details
  4. Token costs stay manageable (~8K per session vs 100K+ for alternatives)

The key insight: agents don’t need perfect memory—they need to know what they know, and be able to retrieve it when needed.

This approach transformed my AI agent workflow from “explain everything every session” to “continue from where we left off.” The time and token savings are substantial, but more importantly, the collaboration feels natural and continuous.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments