How to Build a Persistent Memory System for Claude and LLM Agents
I kept losing context. Every new Claude session started fresh, forgetting everything we discussed yesterday. Project details, preferences, decisions made—all gone. After weeks of re-explaining my work context, I decided to build a proper memory system.
The Problem with LLM Memory
Large language models have no persistent memory. Once a session ends, the context vanishes. This becomes a real problem when you’re using Claude or ChatGPT as a daily assistant.
I tried several approaches:
- Copy-pasting old conversations — Too messy, token limit exceeded quickly
- Keeping a running notes file — Grew to 50K+ tokens, still unorganized
- Manual summarization — Time-consuming, I kept forgetting to do it
What I needed was an automated system that:
- Remembers important information across sessions
- Doesn’t blow up the token budget
- Recalls relevant memories when needed
- Requires minimal manual maintenance
The Solution: A Compaction Tree
After experimenting, I settled on a hierarchical “compaction tree” architecture. The idea is simple: organize memories into levels of increasing compression, where each level summarizes the one below it.
ROOT.md (~3K tokens) | +-- monthly/ (key themes, major decisions) | | | +-- weekly/ (aggregated highlights) | | | +-- daily/ (summarized sessions) | | | +-- raw/ (full transcripts)Nothing is ever lost, just compressed. The raw transcripts live at the bottom, but you don’t load them all at once. Instead, you start with the ROOT index and drill down as needed.
Implementation: File-Based Memory
I chose a file-based approach for simplicity. No database to manage, easy to version control, and portable across systems.
Directory Structure
memory/├── ROOT.md # Topic index (~3K tokens)├── monthly/│ ├── 2024-01.md│ ├── 2024-02.md│ └── 2024-03.md├── weekly/│ ├── 2024-W01.md│ ├── 2024-W02.md│ └── 2024-W03.md├── daily/│ ├── 2024-01-15.md│ ├── 2024-01-16.md│ └── 2024-01-17.md└── raw/ ├── 2024-01-15-session-1.md ├── 2024-01-15-session-2.md └── 2024-01-16-session-1.mdThe ROOT.md Index
The key innovation is the ROOT.md file. It’s a compact index of all topics, typically around 3,000 tokens, that loads at the start of every session.
# Memory Index
## Topics (70+ categories)
### Daily Briefing- Morning routine optimization- Task prioritization patterns
### Investing- Portfolio allocation strategy- REIT analysis notes
### Projects- Side project architecture decisions- Tech stack choices
### Personal CRM- Contact metadata- Meeting summaries
### Calendar- Recurring events- Important datesThis gives the LLM immediate awareness of everything it “knows” without overwhelming the context window.
How Memory Recall Works
When you start a new session, here’s the flow:
1. Load ROOT.md (~3K tokens)2. User asks a question3. Check if relevant memories exist in topic index4. If yes, load specific daily/weekly files5. Inject memories into context6. Generate responseThe magic is in step 3-5. You don’t load everything—only what’s relevant.
Semantic Search for Recall
For a file-based system, I implemented a simple semantic search using embeddings. Each topic and memory chunk gets embedded, and queries match against these embeddings.
import osfrom openai import OpenAIimport numpy as np
client = OpenAI()
def get_embedding(text): response = client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding
def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def search_memories(query, memory_index, top_k=5): query_embedding = get_embedding(query)
results = [] for memory in memory_index: similarity = cosine_similarity(query_embedding, memory["embedding"]) results.append((similarity, memory))
results.sort(reverse=True, key=lambda x: x[0]) return results[:top_k]This approach finds memories by meaning, not just keyword matching.
Auto-Capture: Storing New Memories
The other half of the system is automatically capturing information from conversations. I use a pattern where the LLM extracts important information at the end of each session.
def extract_memories(conversation_transcript): prompt = f""" Extract important information from this conversation that should be remembered.
Format as JSON: {{ "topics": ["topic1", "topic2"], "decisions": ["decision made"], "facts": ["fact learned"], "preferences": ["user preference noted"] }}
Conversation: {conversation_transcript} """
response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] )
return parse_json(response.choices[0].message.content)This runs automatically when a session ends, categorizing and storing new memories in the appropriate daily file.
The Compaction Process
Raw transcripts pile up fast. That’s where compaction comes in. At the end of each day, a job runs to:
- Daily compaction: Summarize raw transcripts into a daily file
- Weekly compaction: Summarize 7 daily files into a weekly summary
- Monthly compaction: Summarize 4 weekly files into a monthly summary
def compact_memories(source_files, output_file, level="daily"): combined = "" for f in source_files: combined += f.read() + "\n\n"
prompt = f""" Summarize these {level} memories into a concise format. Keep: - Key decisions made - Important facts learned - Recurring themes - Action items status
Remove: - Redundant information - Temporary context - Small talk
Memories: {combined} """
summary = llm_generate(prompt) output_file.write(summary)Each level keeps less detail but preserves the essence. The ROOT.md index gets updated to reflect new topics.
Common Mistakes I Made
Storing Everything
My first attempt stored every message. The signal-to-noise ratio was terrible. Memories should be curated, not dumped.
Bad approach:
# WRONG: Storing every messagefor message in conversation: memory.append(message) # Too much noiseBetter approach:
# RIGHT: Extract only what mattersimportant_facts = extract_key_information(conversation)memory.extend(important_facts) # Curated and usefulNo Categorization
A flat list of memories becomes unsearchable. The topic index is essential for efficient retrieval.
No Scoring
Initial versions returned memories in random order. Adding relevance scoring dramatically improved recall quality.
def score_memory_relevance(memory, query, context): score = 0
# Semantic similarity score += cosine_similarity(memory.embedding, query.embedding) * 0.4
# Recency boost days_old = (now - memory.timestamp).days score += (1 / (days_old + 1)) * 0.3
# Topic match if memory.topic in context.topics: score += 0.3
return scoreManual Memory Management
I tried manually adding memories. It lasted three days before I forgot. Auto-capture and auto-compact are non-negotiable features.
Results After Two Months
After running this system for ~60 days of daily sessions:
- ROOT.md: 3,247 tokens, covers 70+ topics across 15 categories
- Recall accuracy: The right memory surfaces in ~90% of relevant queries
- Session startup: ~5 seconds to load context
- Token overhead: Only 3K tokens at session start, plus ~1-2K for specific memories
The categories that emerged naturally:
- daily-briefing- investing- macro- travel- email- finance- personal-crm- calendar- housing- language- health- tax- shopping- side-project- insuranceIntegration with Claude
To use this with Claude specifically, I structure the system prompt to include the memory context:
import anthropic
def build_system_prompt(root_memory, relevant_memories): return f""" You have access to persistent memory from previous sessions.
MEMORY INDEX: {root_memory}
RELEVANT MEMORIES FOR THIS SESSION: {relevant_memories}
Use this context to maintain continuity across sessions. When you learn new important information, note it for storage. """
def chat_with_memory(user_message, memory_system): root = memory_system.load_root() relevant = memory_system.search(user_message)
client = anthropic.Anthropic() response = client.messages.create( model="claude-3-opus-20240229", max_tokens=4096, system=build_system_prompt(root, relevant), messages=[{"role": "user", "content": user_message}] )
return response.contentAlternatives Considered
LangChain Memory: Good for chatbots, but I wanted more control over the compression strategy.
Vector databases (Pinecone, Weaviate): Overkill for personal use, adds infrastructure complexity.
Mem0 and similar services: Works, but I prefer owning my data and understanding the system end-to-end.
The file-based approach hits the sweet spot: simple, transparent, and fully controllable.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments