Why AI Agents Forget Everything Between Sessions (And How to Fix It)
I spent 20 minutes explaining my project architecture to Claude yesterday. Today, it asked me what framework I’m using. Again.
“You tell your agent to restructure your portfolio on Monday, explain your risk tolerance, walk through the rationale. Wednesday it asks your risk tolerance again from scratch. 20 minutes and 30K+ tokens gone on something you already discussed.”
This isn’t a bug. It’s the fundamental design of how LLMs work. And until I understood why this happens, I kept trying solutions that were doomed from the start.
The Problem: Stateless by Design
Here’s what I tried first. I created a MEMORY.md file:
# Project Context
## User Preferences- Prefers TypeScript over JavaScript- Uses functional programming style- Wants detailed comments in code
## Current Project- E-commerce platform- Tech stack: Next.js, PostgreSQL, Prisma- Authentication: ClerkThen I instructed my agent to “always read MEMORY.md before responding.”
This worked for about a week. Then the file grew to 15KB. My agent started skipping sections. Then it started ignoring the file entirely because “it’s too long to read every time.”
The real issue? I was treating a symptom, not the disease.
Why LLMs Can’t Remember
LLMs are fundamentally stateless. Let me show you what actually happens in each session:
┌─────────────────────────────────────────────────────────┐│ Input: [System Prompt] + [Your Message] + [Context] ││ ↓ ││ LLM Processing ││ ↓ ││ Output: [Response] ││ ││ Memory State: NONE (everything discarded after output) │└─────────────────────────────────────────────────────────┘When you send a message to Claude or GPT:
- The API receives your message
- It appends to the conversation history (up to token limit)
- The model processes and generates a response
- Everything is discarded after the response is sent
There’s no hidden database storing your preferences. No secret learning happening in the background. Each API call is isolated.
The Token Limit Trap
I thought larger context windows would solve this. GPT-4 Turbo has 128K tokens. Claude has 200K tokens. That’s a lot of room, right?
Wrong. Here’s what I learned the hard way:
┌──────────────────────────────────────────────────────┐│ Total Context Window: 128K tokens ││ ─────────────────────────────────────────────────────││ System Prompt: ~2K tokens ││ Conversation History: ~50K tokens (past sessions)││ Current Message: ~1K tokens ││ ─────────────────────────────────────────────────────││ Remaining for Response: ~75K tokens ││ ││ Problem: Attention quality DEGRADES with length ││ Cost: Every token costs money, even old context │└──────────────────────────────────────────────────────┘I ran up a $400 API bill in one month re-sending the same context. The model’s attention also degraded - it would “forget” instructions buried in long conversation histories.
Failed Solutions I Tried
Attempt 1: RAG (Retrieval-Augmented Generation)
I built a RAG system to retrieve relevant context:
from langchain.vectorstores import Pineconefrom langchain.embeddings import OpenAIEmbeddings
def get_relevant_context(query: str) -> list[str]: """Retrieve relevant documents for context.""" embeddings = OpenAIEmbeddings() vectorstore = Pinecone.from_existing_index( index_name="project-memory", embedding=embeddings )
results = vectorstore.similarity_search( query=query, k=5 # top 5 relevant chunks )
return [doc.page_content for doc in results]The problem? The agent doesn’t know what to retrieve. I still had to manually specify “search for project architecture” or “look up user preferences.” The awareness problem remained.
Attempt 2: Summary Chain
I tried summarizing each conversation and storing it:
def summarize_conversation(messages: list[dict]) -> str: """Summarize conversation for persistent storage.""" summary_prompt = f""" Summarize the key decisions, preferences, and context from this conversation: {format_messages(messages)}
Focus on: - User preferences stated - Technical decisions made - Constraints established - Questions left unresolved """
response = client.messages.create( model="claude-3-sonnet-20240229", max_tokens=1000, messages=[{"role": "user", "content": summary_prompt}] )
return response.content[0].textThis helped, but summaries grow stale. “User prefers dark mode” is useful. “User is debugging the auth flow in login.tsx line 47” is obsolete after 2 days.
The Solution: Structured Persistent Memory
After months of frustration, I studied how production AI agent systems handle this. They use a three-component architecture:
+------------------+ +-------------------+ +------------------+| Memory Extract | --> | Memory Store | --> | Memory Inject || (What to save) | | (Where to save) | | (When to use) |+------------------+ +-------------------+ +------------------+ │ │ │ ↓ ↓ ↓ Classification Database Query Context Assembly Extraction CRUD Operations Token ManagementComponent 1: Memory Extraction
First, I needed to identify what to remember. Not everything matters:
from enum import Enumfrom typing import Optionalfrom pydantic import BaseModel
class MemoryCategory(str, Enum): PROFILE = "profile" # User info: name, role, company PREFERENCES = "preferences" # Aggregated by topic ENTITIES = "entities" # Projects, people, concepts EVENTS = "events" # Decisions, milestones CASES = "cases" # Specific problems + solutions PATTERNS = "patterns" # Reusable processes/methods
class Memory(BaseModel): id: str category: MemoryCategory content: str metadata: dict created_at: str last_accessed: str access_count: int = 0
def extract_memories(conversation: list[dict]) -> list[Memory]: """Extract structured memories from conversation.""" extraction_prompt = """ Analyze this conversation and extract structured memories.
Categories: - PROFILE: User identity information - PREFERENCES: Stated preferences (coding style, tools, etc.) - ENTITIES: Named items (projects, APIs, libraries) - EVENTS: Decisions made, milestones reached - CASES: Problem-solution pairs - PATTERNS: Reusable workflows discovered
For each memory, provide: 1. Category 2. Content (concise, factual) 3. Confidence (how certain is this memory) 4. Importance (1-10 scale) """
# Call LLM to extract memories extracted = call_extraction_model(conversation, extraction_prompt)
# Filter by confidence and importance return [ Memory( id=generate_id(), category=m["category"], content=m["content"], metadata={"confidence": m["confidence"]}, created_at=datetime.utcnow().isoformat(), last_accessed=datetime.utcnow().isoformat() ) for m in extracted if m["confidence"] > 0.7 and m["importance"] > 3 ]Component 2: Memory Storage
Next, I needed a database that supports efficient retrieval:
CREATE TABLE memories ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), category VARCHAR(50) NOT NULL, content TEXT NOT NULL, embedding vector(1536), -- OpenAI embedding dimension metadata JSONB DEFAULT '{}', created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), last_accessed TIMESTAMP WITH TIME ZONE DEFAULT NOW(), access_count INTEGER DEFAULT 0, importance INTEGER DEFAULT 5);
-- Index for fast category filteringCREATE INDEX idx_memories_category ON memories(category);
-- Index for vector similarity searchCREATE INDEX idx_memories_embedding ON memoriesUSING ivfflat (embedding vector_cosine_ops);
-- Index for recency-based queriesCREATE INDEX idx_memories_accessed ON memories(last_accessed DESC);import asyncpgfrom openai import OpenAI
class MemoryStore: def __init__(self, db_url: str): self.db_url = db_url self.client = OpenAI()
async def store(self, memory: Memory) -> None: """Store a memory with its embedding.""" embedding = self._get_embedding(memory.content)
async with asyncpg.connect(self.db_url) as conn: await conn.execute(""" INSERT INTO memories (category, content, embedding, metadata) VALUES ($1, $2, $3, $4) """, memory.category, memory.content, embedding, memory.metadata)
async def retrieve_relevant( self, query: str, categories: list[MemoryCategory] = None, limit: int = 10 ) -> list[Memory]: """Retrieve memories relevant to a query.""" query_embedding = self._get_embedding(query)
async with asyncpg.connect(self.db_url) as conn: sql = """ SELECT id, category, content, metadata, created_at, 1 - (embedding <=> $1) as similarity FROM memories WHERE ($2::varchar[] IS NULL OR category = ANY($2)) ORDER BY similarity DESC LIMIT $3 """ rows = await conn.fetch( sql, query_embedding, categories, limit )
# Update access statistics for row in rows: await conn.execute(""" UPDATE memories SET last_accessed = NOW(), access_count = access_count + 1 WHERE id = $1 """, row["id"])
return [Memory(**row) for row in rows]
def _get_embedding(self, text: str) -> list[float]: response = self.client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embeddingComponent 3: Memory Injection
Finally, I needed to inject relevant memories at the right time:
from typing import Optional
class MemoryInjector: def __init__(self, memory_store: MemoryStore, token_limit: int = 4000): self.store = memory_store self.token_limit = token_limit
async def inject_context( self, system_prompt: str, user_message: str, conversation_history: list[dict] = None ) -> str: """Build augmented system prompt with relevant memories."""
# Analyze what memories are needed memory_query = await self._determine_memory_needs( user_message, conversation_history )
# Retrieve relevant memories memories = await self.store.retrieve_relevant( query=memory_query, limit=20 # Limit to avoid token overflow )
# Format memories for injection memory_context = self._format_memories(memories)
# Calculate token usage memory_tokens = self._estimate_tokens(memory_context)
if memory_tokens > self.token_limit: # Truncate if necessary memory_context = self._truncate_context( memory_context, self.token_limit )
# Inject into system prompt augmented_prompt = f"""{system_prompt}
## Persistent Memory
{memory_context}
## End Persistent Memory""" return augmented_prompt
async def _determine_memory_needs( self, message: str, history: list[dict] = None ) -> str: """Determine what memories to retrieve based on context.""" # Combine current message with recent history context = message if history: recent = history[-3:] # Last 3 messages context = " ".join([m["content"] for m in recent]) + " " + message
# Use LLM to identify memory categories needed categorization = await self._categorize_needs(context)
return categorization
def _format_memories(self, memories: list[Memory]) -> str: """Format memories for injection into prompt.""" sections = {}
for memory in memories: if memory.category not in sections: sections[memory.category] = [] sections[memory.category].append(memory.content)
output = [] for category, items in sections.items(): output.append(f"### {category.upper()}") for item in items: output.append(f"- {item}") output.append("")
return "\n".join(output)Putting It All Together
Here’s the complete flow I now use:
from langchain_core.messages import HumanMessage, AIMessagefrom langgraph.checkpoint.memory import MemorySaverfrom langgraph.graph import StateGraph, END
class StatefulAgent: def __init__(self): self.extractor = MemoryExtractor() self.store = MemoryStore(DB_URL) self.injector = MemoryInjector(self.store) self.llm = ChatAnthropic(model="claude-3-sonnet-20240229")
async def chat(self, user_id: str, message: str) -> str: """Process a message with persistent memory."""
# 1. Retrieve relevant memories system_prompt = await self.injector.inject_context( system_prompt="You are a helpful coding assistant.", user_message=message )
# 2. Get conversation history (if any) history = await self.store.get_conversation_history(user_id)
# 3. Generate response response = await self.llm.ainvoke([ {"role": "system", "content": system_prompt}, *history, {"role": "user", "content": message} ])
# 4. Extract and store new memories new_memories = self.extractor.extract_memories([ {"role": "user", "content": message}, {"role": "assistant", "content": response.content} ])
for memory in new_memories: await self.store.store(user_id, memory)
# 5. Update conversation history await self.store.append_history( user_id, [ {"role": "user", "content": message}, {"role": "assistant", "content": response.content} ] )
return response.contentKey Lessons Learned
After implementing this system, my API costs dropped by 60% and my agent actually remembers things between sessions. Here’s what I learned:
1. Categorize memories - don’t dump everything into one pile. PROFILE memories rarely change. PREFERENCES need updates. EVENTS become stale quickly. Different categories need different retention policies.
2. Use embeddings for semantic retrieval. Keyword search fails when users express the same concept differently. “I hate verbose code” and “keep functions short” should retrieve the same preference.
3. Token budgets are real. Even with 200K context windows, you can’t stuff everything. I limit memory injection to 4K tokens and use recency + importance scoring to prioritize.
4. Memory extraction is lossy. The LLM won’t capture everything important. I keep raw conversation logs and re-extract memories periodically.
5. Privacy matters. Memory persistence means sensitive data persists too. I added user-level isolation and memory expiration policies.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments