Skip to content

How to Build a Persistent Memory System for Claude and LLM Agents

I kept losing context. Every new Claude session started fresh, forgetting everything we discussed yesterday. Project details, preferences, decisions made—all gone. After weeks of re-explaining my work context, I decided to build a proper memory system.

The Problem with LLM Memory

Large language models have no persistent memory. Once a session ends, the context vanishes. This becomes a real problem when you’re using Claude or ChatGPT as a daily assistant.

I tried several approaches:

  1. Copy-pasting old conversations — Too messy, token limit exceeded quickly
  2. Keeping a running notes file — Grew to 50K+ tokens, still unorganized
  3. Manual summarization — Time-consuming, I kept forgetting to do it

What I needed was an automated system that:

  • Remembers important information across sessions
  • Doesn’t blow up the token budget
  • Recalls relevant memories when needed
  • Requires minimal manual maintenance

The Solution: A Compaction Tree

After experimenting, I settled on a hierarchical “compaction tree” architecture. The idea is simple: organize memories into levels of increasing compression, where each level summarizes the one below it.

Memory Tree Structure
ROOT.md (~3K tokens)
|
+-- monthly/ (key themes, major decisions)
| |
| +-- weekly/ (aggregated highlights)
| |
| +-- daily/ (summarized sessions)
| |
| +-- raw/ (full transcripts)

Nothing is ever lost, just compressed. The raw transcripts live at the bottom, but you don’t load them all at once. Instead, you start with the ROOT index and drill down as needed.

Implementation: File-Based Memory

I chose a file-based approach for simplicity. No database to manage, easy to version control, and portable across systems.

Directory Structure

Memory Directory Layout
memory/
├── ROOT.md # Topic index (~3K tokens)
├── monthly/
│ ├── 2024-01.md
│ ├── 2024-02.md
│ └── 2024-03.md
├── weekly/
│ ├── 2024-W01.md
│ ├── 2024-W02.md
│ └── 2024-W03.md
├── daily/
│ ├── 2024-01-15.md
│ ├── 2024-01-16.md
│ └── 2024-01-17.md
└── raw/
├── 2024-01-15-session-1.md
├── 2024-01-15-session-2.md
└── 2024-01-16-session-1.md

The ROOT.md Index

The key innovation is the ROOT.md file. It’s a compact index of all topics, typically around 3,000 tokens, that loads at the start of every session.

ROOT.md
# Memory Index
## Topics (70+ categories)
### Daily Briefing
- Morning routine optimization
- Task prioritization patterns
### Investing
- Portfolio allocation strategy
- REIT analysis notes
### Projects
- Side project architecture decisions
- Tech stack choices
### Personal CRM
- Contact metadata
- Meeting summaries
### Calendar
- Recurring events
- Important dates

This gives the LLM immediate awareness of everything it “knows” without overwhelming the context window.

How Memory Recall Works

When you start a new session, here’s the flow:

Session Startup Flow
1. Load ROOT.md (~3K tokens)
2. User asks a question
3. Check if relevant memories exist in topic index
4. If yes, load specific daily/weekly files
5. Inject memories into context
6. Generate response

The magic is in step 3-5. You don’t load everything—only what’s relevant.

Semantic Search for Recall

For a file-based system, I implemented a simple semantic search using embeddings. Each topic and memory chunk gets embedded, and queries match against these embeddings.

memory_search.py
import os
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def search_memories(query, memory_index, top_k=5):
query_embedding = get_embedding(query)
results = []
for memory in memory_index:
similarity = cosine_similarity(query_embedding, memory["embedding"])
results.append((similarity, memory))
results.sort(reverse=True, key=lambda x: x[0])
return results[:top_k]

This approach finds memories by meaning, not just keyword matching.

Auto-Capture: Storing New Memories

The other half of the system is automatically capturing information from conversations. I use a pattern where the LLM extracts important information at the end of each session.

memory_capture.py
def extract_memories(conversation_transcript):
prompt = f"""
Extract important information from this conversation that should be remembered.
Format as JSON:
{{
"topics": ["topic1", "topic2"],
"decisions": ["decision made"],
"facts": ["fact learned"],
"preferences": ["user preference noted"]
}}
Conversation:
{conversation_transcript}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return parse_json(response.choices[0].message.content)

This runs automatically when a session ends, categorizing and storing new memories in the appropriate daily file.

The Compaction Process

Raw transcripts pile up fast. That’s where compaction comes in. At the end of each day, a job runs to:

  1. Daily compaction: Summarize raw transcripts into a daily file
  2. Weekly compaction: Summarize 7 daily files into a weekly summary
  3. Monthly compaction: Summarize 4 weekly files into a monthly summary
compaction.py
def compact_memories(source_files, output_file, level="daily"):
combined = ""
for f in source_files:
combined += f.read() + "\n\n"
prompt = f"""
Summarize these {level} memories into a concise format.
Keep:
- Key decisions made
- Important facts learned
- Recurring themes
- Action items status
Remove:
- Redundant information
- Temporary context
- Small talk
Memories:
{combined}
"""
summary = llm_generate(prompt)
output_file.write(summary)

Each level keeps less detail but preserves the essence. The ROOT.md index gets updated to reflect new topics.

Common Mistakes I Made

Storing Everything

My first attempt stored every message. The signal-to-noise ratio was terrible. Memories should be curated, not dumped.

Bad approach:

bad_memory.py
# WRONG: Storing every message
for message in conversation:
memory.append(message) # Too much noise

Better approach:

good_memory.py
# RIGHT: Extract only what matters
important_facts = extract_key_information(conversation)
memory.extend(important_facts) # Curated and useful

No Categorization

A flat list of memories becomes unsearchable. The topic index is essential for efficient retrieval.

No Scoring

Initial versions returned memories in random order. Adding relevance scoring dramatically improved recall quality.

scoring.py
def score_memory_relevance(memory, query, context):
score = 0
# Semantic similarity
score += cosine_similarity(memory.embedding, query.embedding) * 0.4
# Recency boost
days_old = (now - memory.timestamp).days
score += (1 / (days_old + 1)) * 0.3
# Topic match
if memory.topic in context.topics:
score += 0.3
return score

Manual Memory Management

I tried manually adding memories. It lasted three days before I forgot. Auto-capture and auto-compact are non-negotiable features.

Results After Two Months

After running this system for ~60 days of daily sessions:

  • ROOT.md: 3,247 tokens, covers 70+ topics across 15 categories
  • Recall accuracy: The right memory surfaces in ~90% of relevant queries
  • Session startup: ~5 seconds to load context
  • Token overhead: Only 3K tokens at session start, plus ~1-2K for specific memories

The categories that emerged naturally:

Topic Categories
- daily-briefing
- investing
- macro
- travel
- email
- finance
- personal-crm
- calendar
- housing
- language
- health
- tax
- shopping
- side-project
- insurance

Integration with Claude

To use this with Claude specifically, I structure the system prompt to include the memory context:

claude_integration.py
import anthropic
def build_system_prompt(root_memory, relevant_memories):
return f"""
You have access to persistent memory from previous sessions.
MEMORY INDEX:
{root_memory}
RELEVANT MEMORIES FOR THIS SESSION:
{relevant_memories}
Use this context to maintain continuity across sessions.
When you learn new important information, note it for storage.
"""
def chat_with_memory(user_message, memory_system):
root = memory_system.load_root()
relevant = memory_system.search(user_message)
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=4096,
system=build_system_prompt(root, relevant),
messages=[{"role": "user", "content": user_message}]
)
return response.content

Alternatives Considered

LangChain Memory: Good for chatbots, but I wanted more control over the compression strategy.

Vector databases (Pinecone, Weaviate): Overkill for personal use, adds infrastructure complexity.

Mem0 and similar services: Works, but I prefer owning my data and understanding the system end-to-end.

The file-based approach hits the sweet spot: simple, transparent, and fully controllable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments