How to Implement Persistent Memory for AI Agents: A Practical Guide
My AI agent forgot everything again. After a 30-minute conversation where we debugged a complex authentication flow, I ended the session. The next day, I started a new session and asked about the solution we found.
Blank stare. “I don’t have any context about that authentication issue you’re referring to.”
I had to spend another 20 minutes and 30K+ tokens re-explaining the entire context. This wasn’t just frustrating—it was a fundamental problem with how AI agents handle memory.
The Problem: Session Amnesia
The core issue is simple: AI agents have no persistent memory between sessions. Every conversation starts from zero. This creates three major problems:
- Context loss: All the context you built up is gone
- Token waste: Re-explaining the same information costs money and time
- Broken continuity: You can’t build on previous work
I tried several “solutions” that didn’t work. Let me walk through what I attempted and why each failed.
Failed Approach #1: The MEMORY.md File
My first attempt was straightforward: maintain a MEMORY.md file that stores important context.
# Project Context
## Authentication Flow (2026-03-15)- Using JWT with refresh tokens- Token expiry: 15 minutes- Refresh token stored in HTTP-only cookie- Issue: CORS problems with credentials
## Database Schema (2026-03-18)- Users table with role-based access- Posts table with soft deletes- Need to add: audit log tableThis worked for about a week. Then problems emerged:
The Overflow Problem: The file grew to 50KB. Loading it into context consumed 12K tokens every session.
The Pruning Problem: I had to manually decide what to keep. Delete too much? Lose valuable context. Keep too much? Context window explodes.
The Retrieval Problem: The agent had to read the entire file to find relevant information. No structure, no indexing.
I realized a flat file approach fundamentally doesn’t scale. The agent needs something smarter.
Failed Approach #2: RAG / Vector Search
“Use RAG!” everyone said. “Vector search is the solution!”
I implemented a vector database with embeddings for all past conversations. Here’s the basic setup:
from langchain.embeddings import OpenAIEmbeddingsfrom langchain.vectorstores import Chromafrom langchain.schema import Document
class MemoryRAG: def __init__(self, persist_directory: str): self.embeddings = OpenAIEmbeddings() self.vectorstore = Chroma( embedding_function=self.embeddings, persist_directory=persist_directory )
def store_conversation(self, conversation: str, metadata: dict): """Store a conversation chunk in vector database.""" doc = Document(page_content=conversation, metadata=metadata) self.vectorstore.add_documents([doc])
def recall(self, query: str, k: int = 5) -> list[Document]: """Retrieve relevant conversations.""" return self.vectorstore.similarity_search(query, k=k)This worked great for queries like:
- “What was the authentication solution?”
- “How did we fix the CORS error?”
But there was a critical flaw I discovered when I asked: “Do we have any context about the payment system integration?”
The agent responded: “Let me search for payment system information…”
It found nothing. But we HAD discussed payments three weeks ago. The problem? I didn’t know the right search terms. The agent didn’t know what it knew.
The Awareness Problem: RAG answers “find X” queries perfectly. But it fails at “do I know anything about X?” awareness questions. The agent can’t tell “I know this” from “I’ve never seen this” without loading everything or searching for something specific.
Vector search solves retrieval, not awareness.
Failed Approach #3: Just Use Larger Context Windows
“GPT-4 has 128K context! GPT-4 Turbo has 128K! Just load everything!”
I tried loading the entire conversation history into context. Here’s what happened:
Attention Degradation: Studies show model performance degrades significantly when relevant information is buried in long contexts. The agent “forgets” earlier context.
Cost Explosion: A 100K token context window costs significant money. Even with price drops, this isn’t sustainable for daily use.
Practical Limits: Even 128K isn’t infinite. Long-running projects exceed this quickly.
Large context windows help, but they’re not a memory solution—they’re a bandage.
The Real Solution: Hierarchical Compaction
After months of frustration, I found an approach that actually works: a compaction tree that maintains topic awareness without loading everything.
The Core Insight
The key insight from Reddit discussions and research papers: you need an index of what you know, not the actual content. The agent needs to answer “do I have context on X?” in milliseconds, not by searching through thousands of tokens.
The Compaction Tree Structure
Imagine conversation history flowing through a compression pipeline:
┌─────────────────────────────────────────────────────────┐│ ROOT.md ││ (Topic Index - ~3K tokens, loaded at session start) ││ Contains: High-level summaries, topic keywords, dates │└───────────────────────┬─────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │Monthly 1│ │Monthly 2│ │Monthly 3│ │Summary │ │Summary │ │Summary │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ ▼ ▼ ▼ ▼ ▼ ▼ Weekly Weekly Weekly Weekly Weekly Weekly │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ Daily Daily Daily Daily Daily Daily │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ Raw Raw Raw Raw Raw RawConversation ChunksLevel Definitions
Level 0 - Raw: The actual conversation chunks, token-limited segments of the original dialogue.
Level 1 - Daily: A summary of all conversations from one day, highlighting key decisions, questions, and outcomes.
Level 2 - Weekly: A compressed summary of the week’s daily summaries, extracting patterns and major themes.
Level 3 - Monthly: High-level overview of the month, capturing major milestones and project evolution.
Level 4 - Root: The topic index loaded at session start. Contains just enough information to answer “do I know about X?”
Implementation Strategy
Here’s how I implemented this in Python:
from datetime import datetime, timedeltafrom pathlib import Pathfrom dataclasses import dataclassfrom typing import Optionalimport json
@dataclassclass ConversationChunk: """Raw conversation segment.""" id: str timestamp: datetime content: str tokens: int topics: list[str]
@dataclassclass CompactedSummary: """Compressed summary at any level.""" level: int # 1=daily, 2=weekly, 3=monthly, 4=root period_start: datetime period_end: datetime summary: str topics: list[str] source_ids: list[str] # IDs of summarized items token_count: int
class CompactionTree: def __init__(self, base_path: Path): self.base_path = base_path self.raw_path = base_path / "raw" self.daily_path = base_path / "daily" self.weekly_path = base_path / "weekly" self.monthly_path = base_path / "monthly" self.root_path = base_path / "ROOT.md"
# Create directories for path in [self.raw_path, self.daily_path, self.weekly_path, self.monthly_path]: path.mkdir(parents=True, exist_ok=True)
def add_conversation(self, chunk: ConversationChunk): """Add a new conversation chunk to the tree.""" # Store raw chunk chunk_file = self.raw_path / f"{chunk.id}.json" chunk_file.write_text(json.dumps({ "id": chunk.id, "timestamp": chunk.timestamp.isoformat(), "content": chunk.content, "tokens": chunk.tokens, "topics": chunk.topics }))
# Trigger compaction check self._check_compaction_needed()
def get_root_context(self) -> str: """Load the ROOT.md for session start.""" if self.root_path.exists(): return self.root_path.read_text() return "# No previous context\n\nThis is a fresh session."
def find_relevant_context(self, topic: str) -> list[str]: """Find all context related to a topic.""" # First check ROOT for awareness root = self.get_root_context() if topic.lower() not in root.lower(): return [] # Agent knows it doesn't know this
# Topic exists, now load relevant summaries results = [] for monthly in self.monthly_path.glob("*.json"): data = json.loads(monthly.read_text()) if topic.lower() in data["summary"].lower(): results.append(data["summary"])
return resultsThe ROOT.md Structure
The ROOT.md file is critical. Here’s an example:
# Project Context IndexLast Updated: 2026-03-21
## Active Topics
### Authentication (Last: 2026-03-15)- JWT with refresh tokens implemented- CORS issues resolved with credentials- See: daily/2026-03-15.json
### Database Design (Last: 2026-03-18)- Schema with RBAC complete- Soft deletes implemented- Audit log pending- See: weekly/2026-W11.json
### Payment Integration (Last: 2026-02-28)- Stripe checkout configured- Webhook handling implemented- See: monthly/2026-02.json
## Inactive Topics
### Logging System (Last: 2026-01-10)- Basic logging complete- Structured logging pending- See: monthly/2026-01.jsonThis ~3K token file gives the agent instant awareness of everything it knows.
Compaction Algorithm
The compaction process runs periodically:
from llm import summarize_with_llm # Your LLM call
def compact_level(sources: list, target_level: int, period: tuple[datetime, datetime]) -> CompactedSummary: """Compact multiple sources into a higher-level summary."""
# Combine source content combined = "\n\n---\n\n".join([ f"[{s.timestamp}]\n{s.summary}" for s in sources ])
# Use LLM to create hierarchical summary prompt = f"""Create a {['', 'daily', 'weekly', 'monthly'][target_level]} summary of these conversations from {period[0]} to {period[1]}.
Focus on: 1. Key decisions made 2. Problems solved 3. Pending items 4. Topics covered (for indexing)
Conversations: {combined} """
summary_text = summarize_with_llm(prompt, max_tokens=1000)
# Extract topics for indexing topics = extract_topics(summary_text)
return CompactedSummary( level=target_level, period_start=period[0], period_end=period[1], summary=summary_text, topics=topics, source_ids=[s.id for s in sources], token_count=count_tokens(summary_text) )Session Start Protocol
When a new session begins, here’s the protocol:
def start_session(agent, tree: CompactionTree) -> dict: """Initialize a new session with persistent memory."""
# 1. Load ROOT.md (~3K tokens) root_context = tree.get_root_context()
# 2. Inject into system prompt system_prompt = f"""You are an AI assistant with persistent memory.
CONTEXT INDEX (you have memory of these topics): {root_context}
When the user asks about something: 1. Check if it's in the context index 2. If yes, say "I have context on this from [date]. Let me retrieve details." 3. If no, say "I don't have previous context on this topic."
Never claim ignorance for topics in the index. Never claim knowledge for topics not in the index. """
# 3. Ready to receive queries return { "system_prompt": system_prompt, "root_context": root_context, "tree": tree # For on-demand retrieval }
def retrieve_details(tree: CompactionTree, topic: str) -> str: """Retrieve detailed context when needed."""
# Walk down the tree from monthly to daily context_parts = []
# Load relevant monthly summary for monthly_file in tree.monthly_path.glob("*.json"): data = json.loads(monthly_file.read_text()) if topic.lower() in str(data["topics"]).lower(): context_parts.append(f"Monthly Summary:\n{data['summary']}")
# Load relevant weekly summaries for week_id in data.get("weekly_ids", []): week_file = tree.weekly_path / f"{week_id}.json" if week_file.exists(): week_data = json.loads(week_file.read_text()) if topic.lower() in str(week_data["topics"]).lower(): context_parts.append(f"\nWeekly Detail:\n{week_data['summary']}")
return "\n\n".join(context_parts) if context_parts else "No detailed context found."Why This Works
The compaction tree solves all three core problems:
Problem 1: Context Loss
- Solution: ROOT.md provides instant awareness of all past topics
- Agent knows “I have context on authentication” without loading everything
Problem 2: Token Waste
- Solution: Only load what’s needed (ROOT at start, details on demand)
- Typical session: 3K tokens for awareness + 5K for relevant details = 8K total
- Compare to: 30K+ for re-explaining or 100K+ for full context
Problem 3: Broken Continuity
- Solution: Hierarchical structure preserves relationships between topics
- Weekly summaries maintain thread connections
- Monthly summaries show project evolution
Practical Considerations
When to Compact
Don’t compact immediately after every conversation. I use these triggers:
def should_compact(tree: CompactionTree) -> tuple[bool, str]: """Determine if compaction is needed."""
# Daily: Compact at end of day or when >10 raw chunks exist raw_count = len(list(tree.raw_path.glob("*.json"))) if raw_count > 10: return True, "daily"
# Weekly: Compact on Sunday or when >7 daily summaries exist daily_count = len(list(tree.daily_path.glob("*.json"))) if daily_count > 7 or datetime.now().weekday() == 6: return True, "weekly"
# Monthly: Compact on 1st or when >4 weekly summaries exist weekly_count = len(list(tree.weekly_path.glob("*.json"))) if weekly_count > 4 or datetime.now().day == 1: return True, "monthly"
return False, ""Handling Topic Drift
Topics evolve. The ROOT.md needs periodic rebuilding:
def rebuild_root(tree: CompactionTree): """Rebuild ROOT.md from monthly summaries."""
topics = {} # topic -> {last_seen, monthly_id, summary}
# Scan all monthly summaries for monthly_file in sorted(tree.monthly_path.glob("*.json")): data = json.loads(monthly_file.read_text())
for topic in data["topics"]: if topic not in topics: topics[topic] = { "last_seen": data["period_end"], "monthly_id": monthly_file.stem, "summary": data["summary"][:200] # Preview } else: topics[topic]["last_seen"] = data["period_end"]
# Generate ROOT.md active = {k: v for k, v in topics.items() if (datetime.now() - v["last_seen"]).days < 30} inactive = {k: v for k, v in topics.items() if (datetime.now() - v["last_seen"]).days >= 30}
root_content = generate_root_markdown(active, inactive) tree.root_path.write_text(root_content)Storage Requirements
Here’s what I’ve seen in practice:
Raw conversations: ~500MB (50MB compressed)Daily summaries: ~50MB (5MB compressed)Weekly summaries: ~5MB (500KB compressed)Monthly summaries: ~500KB (50KB compressed)ROOT.md: ~15KB (3K tokens)
Total: ~555MB (55MB with compression)The compression ratio improves at each level, and only ROOT.md needs to be in memory at session start.
Integration with LangChain
Here’s how to integrate with LangChain’s memory system:
from langchain.schema import BaseMemoryfrom typing import Any, Dict, List
class CompactionTreeMemory(BaseMemory): """LangChain memory backed by compaction tree."""
tree: CompactionTree session_context: str = ""
@property def memory_variables(self) -> List[str]: return ["context_index", "relevant_context"]
def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]: """Load memory variables for prompt."""
# Always provide the root index result = {"context_index": self.tree.get_root_context()}
# Check if user query relates to known topics user_input = inputs.get("input", "") relevant = self.tree.find_relevant_context(user_input)
if relevant: result["relevant_context"] = "\n\n".join(relevant) else: result["relevant_context"] = "No relevant previous context."
return result
def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, Any]) -> None: """Save conversation to memory."""
chunk = ConversationChunk( id=generate_id(), timestamp=datetime.now(), content=f"User: {inputs.get('input', '')}\nAI: {outputs.get('output', '')}", tokens=estimate_tokens(inputs.get('input', '') + outputs.get('output', '')), topics=extract_topics(inputs.get('input', '')) )
self.tree.add_conversation(chunk)
def clear(self) -> None: """Clear session context (not persistent memory).""" self.session_context = ""Limitations and Trade-offs
This approach isn’t perfect:
Information Loss: Each compaction level loses detail. Important specifics might be compressed away. Mitigate by keeping raw logs for manual review.
Recency Bias: The topic extraction might miss older but relevant context. Mitigate by occasionally scanning ROOT.md for connections.
LLM Dependency: Compaction requires LLM calls. Budget for this. I spend about $2/month on compaction for daily use.
Implementation Complexity: More complex than MEMORY.md or basic RAG. But the complexity buys you real awareness.
Comparison to Hosted Solutions
You might ask: “Why not use hosted memory services?”
Hosted services like MemGPT or Letta provide similar functionality, but consider:
Dependencies: Another API key, another service to monitor Costs: Subscription fees on top of LLM costs Data Control: Your conversation history on someone else’s servers Customization: Limited ability to tune compaction strategies
If these concerns matter to you, a self-hosted compaction tree is worth the implementation effort.
Summary
The compaction tree approach works because it separates awareness from retrieval:
- ROOT.md provides instant topic awareness at session start
- Hierarchical compression maintains context relationships
- On-demand retrieval loads only relevant details
- Token costs stay manageable (~8K per session vs 100K+ for alternatives)
The key insight: agents don’t need perfect memory—they need to know what they know, and be able to retrieve it when needed.
This approach transformed my AI agent workflow from “explain everything every session” to “continue from where we left off.” The time and token savings are substantial, but more importantly, the collaboration feels natural and continuous.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments