Skip to content

Why AI Agents Forget Everything Between Sessions (And How to Fix It)

I spent 20 minutes explaining my project architecture to Claude yesterday. Today, it asked me what framework I’m using. Again.

“You tell your agent to restructure your portfolio on Monday, explain your risk tolerance, walk through the rationale. Wednesday it asks your risk tolerance again from scratch. 20 minutes and 30K+ tokens gone on something you already discussed.”

This isn’t a bug. It’s the fundamental design of how LLMs work. And until I understood why this happens, I kept trying solutions that were doomed from the start.

The Problem: Stateless by Design

Here’s what I tried first. I created a MEMORY.md file:

MEMORY.md
# Project Context
## User Preferences
- Prefers TypeScript over JavaScript
- Uses functional programming style
- Wants detailed comments in code
## Current Project
- E-commerce platform
- Tech stack: Next.js, PostgreSQL, Prisma
- Authentication: Clerk

Then I instructed my agent to “always read MEMORY.md before responding.”

This worked for about a week. Then the file grew to 15KB. My agent started skipping sections. Then it started ignoring the file entirely because “it’s too long to read every time.”

The real issue? I was treating a symptom, not the disease.

Why LLMs Can’t Remember

LLMs are fundamentally stateless. Let me show you what actually happens in each session:

Session Anatomy
┌─────────────────────────────────────────────────────────┐
│ Input: [System Prompt] + [Your Message] + [Context] │
│ ↓ │
│ LLM Processing │
│ ↓ │
│ Output: [Response] │
│ │
│ Memory State: NONE (everything discarded after output) │
└─────────────────────────────────────────────────────────┘

When you send a message to Claude or GPT:

  1. The API receives your message
  2. It appends to the conversation history (up to token limit)
  3. The model processes and generates a response
  4. Everything is discarded after the response is sent

There’s no hidden database storing your preferences. No secret learning happening in the background. Each API call is isolated.

The Token Limit Trap

I thought larger context windows would solve this. GPT-4 Turbo has 128K tokens. Claude has 200K tokens. That’s a lot of room, right?

Wrong. Here’s what I learned the hard way:

Context Window Reality
┌──────────────────────────────────────────────────────┐
│ Total Context Window: 128K tokens │
│ ─────────────────────────────────────────────────────│
│ System Prompt: ~2K tokens │
│ Conversation History: ~50K tokens (past sessions)│
│ Current Message: ~1K tokens │
│ ─────────────────────────────────────────────────────│
│ Remaining for Response: ~75K tokens │
│ │
│ Problem: Attention quality DEGRADES with length │
│ Cost: Every token costs money, even old context │
└──────────────────────────────────────────────────────┘

I ran up a $400 API bill in one month re-sending the same context. The model’s attention also degraded - it would “forget” instructions buried in long conversation histories.

Failed Solutions I Tried

Attempt 1: RAG (Retrieval-Augmented Generation)

I built a RAG system to retrieve relevant context:

rag_retrieval.py
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
def get_relevant_context(query: str) -> list[str]:
"""Retrieve relevant documents for context."""
embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_existing_index(
index_name="project-memory",
embedding=embeddings
)
results = vectorstore.similarity_search(
query=query,
k=5 # top 5 relevant chunks
)
return [doc.page_content for doc in results]

The problem? The agent doesn’t know what to retrieve. I still had to manually specify “search for project architecture” or “look up user preferences.” The awareness problem remained.

Attempt 2: Summary Chain

I tried summarizing each conversation and storing it:

summarize_session.py
def summarize_conversation(messages: list[dict]) -> str:
"""Summarize conversation for persistent storage."""
summary_prompt = f"""
Summarize the key decisions, preferences, and context from this conversation:
{format_messages(messages)}
Focus on:
- User preferences stated
- Technical decisions made
- Constraints established
- Questions left unresolved
"""
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1000,
messages=[{"role": "user", "content": summary_prompt}]
)
return response.content[0].text

This helped, but summaries grow stale. “User prefers dark mode” is useful. “User is debugging the auth flow in login.tsx line 47” is obsolete after 2 days.

The Solution: Structured Persistent Memory

After months of frustration, I studied how production AI agent systems handle this. They use a three-component architecture:

Memory Architecture
+------------------+ +-------------------+ +------------------+
| Memory Extract | --> | Memory Store | --> | Memory Inject |
| (What to save) | | (Where to save) | | (When to use) |
+------------------+ +-------------------+ +------------------+
│ │ │
↓ ↓ ↓
Classification Database Query Context Assembly
Extraction CRUD Operations Token Management

Component 1: Memory Extraction

First, I needed to identify what to remember. Not everything matters:

memory_extractor.py
from enum import Enum
from typing import Optional
from pydantic import BaseModel
class MemoryCategory(str, Enum):
PROFILE = "profile" # User info: name, role, company
PREFERENCES = "preferences" # Aggregated by topic
ENTITIES = "entities" # Projects, people, concepts
EVENTS = "events" # Decisions, milestones
CASES = "cases" # Specific problems + solutions
PATTERNS = "patterns" # Reusable processes/methods
class Memory(BaseModel):
id: str
category: MemoryCategory
content: str
metadata: dict
created_at: str
last_accessed: str
access_count: int = 0
def extract_memories(conversation: list[dict]) -> list[Memory]:
"""Extract structured memories from conversation."""
extraction_prompt = """
Analyze this conversation and extract structured memories.
Categories:
- PROFILE: User identity information
- PREFERENCES: Stated preferences (coding style, tools, etc.)
- ENTITIES: Named items (projects, APIs, libraries)
- EVENTS: Decisions made, milestones reached
- CASES: Problem-solution pairs
- PATTERNS: Reusable workflows discovered
For each memory, provide:
1. Category
2. Content (concise, factual)
3. Confidence (how certain is this memory)
4. Importance (1-10 scale)
"""
# Call LLM to extract memories
extracted = call_extraction_model(conversation, extraction_prompt)
# Filter by confidence and importance
return [
Memory(
id=generate_id(),
category=m["category"],
content=m["content"],
metadata={"confidence": m["confidence"]},
created_at=datetime.utcnow().isoformat(),
last_accessed=datetime.utcnow().isoformat()
)
for m in extracted
if m["confidence"] > 0.7 and m["importance"] > 3
]

Component 2: Memory Storage

Next, I needed a database that supports efficient retrieval:

schema.sql
CREATE TABLE memories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
category VARCHAR(50) NOT NULL,
content TEXT NOT NULL,
embedding vector(1536), -- OpenAI embedding dimension
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
last_accessed TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
access_count INTEGER DEFAULT 0,
importance INTEGER DEFAULT 5
);
-- Index for fast category filtering
CREATE INDEX idx_memories_category ON memories(category);
-- Index for vector similarity search
CREATE INDEX idx_memories_embedding ON memories
USING ivfflat (embedding vector_cosine_ops);
-- Index for recency-based queries
CREATE INDEX idx_memories_accessed ON memories(last_accessed DESC);
memory_store.py
import asyncpg
from openai import OpenAI
class MemoryStore:
def __init__(self, db_url: str):
self.db_url = db_url
self.client = OpenAI()
async def store(self, memory: Memory) -> None:
"""Store a memory with its embedding."""
embedding = self._get_embedding(memory.content)
async with asyncpg.connect(self.db_url) as conn:
await conn.execute("""
INSERT INTO memories (category, content, embedding, metadata)
VALUES ($1, $2, $3, $4)
""", memory.category, memory.content, embedding, memory.metadata)
async def retrieve_relevant(
self,
query: str,
categories: list[MemoryCategory] = None,
limit: int = 10
) -> list[Memory]:
"""Retrieve memories relevant to a query."""
query_embedding = self._get_embedding(query)
async with asyncpg.connect(self.db_url) as conn:
sql = """
SELECT id, category, content, metadata, created_at,
1 - (embedding <=> $1) as similarity
FROM memories
WHERE ($2::varchar[] IS NULL OR category = ANY($2))
ORDER BY similarity DESC
LIMIT $3
"""
rows = await conn.fetch(
sql,
query_embedding,
categories,
limit
)
# Update access statistics
for row in rows:
await conn.execute("""
UPDATE memories
SET last_accessed = NOW(),
access_count = access_count + 1
WHERE id = $1
""", row["id"])
return [Memory(**row) for row in rows]
def _get_embedding(self, text: str) -> list[float]:
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding

Component 3: Memory Injection

Finally, I needed to inject relevant memories at the right time:

memory_injector.py
from typing import Optional
class MemoryInjector:
def __init__(self, memory_store: MemoryStore, token_limit: int = 4000):
self.store = memory_store
self.token_limit = token_limit
async def inject_context(
self,
system_prompt: str,
user_message: str,
conversation_history: list[dict] = None
) -> str:
"""Build augmented system prompt with relevant memories."""
# Analyze what memories are needed
memory_query = await self._determine_memory_needs(
user_message,
conversation_history
)
# Retrieve relevant memories
memories = await self.store.retrieve_relevant(
query=memory_query,
limit=20 # Limit to avoid token overflow
)
# Format memories for injection
memory_context = self._format_memories(memories)
# Calculate token usage
memory_tokens = self._estimate_tokens(memory_context)
if memory_tokens > self.token_limit:
# Truncate if necessary
memory_context = self._truncate_context(
memory_context,
self.token_limit
)
# Inject into system prompt
augmented_prompt = f"""
{system_prompt}
## Persistent Memory
{memory_context}
## End Persistent Memory
"""
return augmented_prompt
async def _determine_memory_needs(
self,
message: str,
history: list[dict] = None
) -> str:
"""Determine what memories to retrieve based on context."""
# Combine current message with recent history
context = message
if history:
recent = history[-3:] # Last 3 messages
context = " ".join([m["content"] for m in recent]) + " " + message
# Use LLM to identify memory categories needed
categorization = await self._categorize_needs(context)
return categorization
def _format_memories(self, memories: list[Memory]) -> str:
"""Format memories for injection into prompt."""
sections = {}
for memory in memories:
if memory.category not in sections:
sections[memory.category] = []
sections[memory.category].append(memory.content)
output = []
for category, items in sections.items():
output.append(f"### {category.upper()}")
for item in items:
output.append(f"- {item}")
output.append("")
return "\n".join(output)

Putting It All Together

Here’s the complete flow I now use:

agent_with_memory.py
from langchain_core.messages import HumanMessage, AIMessage
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, END
class StatefulAgent:
def __init__(self):
self.extractor = MemoryExtractor()
self.store = MemoryStore(DB_URL)
self.injector = MemoryInjector(self.store)
self.llm = ChatAnthropic(model="claude-3-sonnet-20240229")
async def chat(self, user_id: str, message: str) -> str:
"""Process a message with persistent memory."""
# 1. Retrieve relevant memories
system_prompt = await self.injector.inject_context(
system_prompt="You are a helpful coding assistant.",
user_message=message
)
# 2. Get conversation history (if any)
history = await self.store.get_conversation_history(user_id)
# 3. Generate response
response = await self.llm.ainvoke([
{"role": "system", "content": system_prompt},
*history,
{"role": "user", "content": message}
])
# 4. Extract and store new memories
new_memories = self.extractor.extract_memories([
{"role": "user", "content": message},
{"role": "assistant", "content": response.content}
])
for memory in new_memories:
await self.store.store(user_id, memory)
# 5. Update conversation history
await self.store.append_history(
user_id,
[
{"role": "user", "content": message},
{"role": "assistant", "content": response.content}
]
)
return response.content

Key Lessons Learned

After implementing this system, my API costs dropped by 60% and my agent actually remembers things between sessions. Here’s what I learned:

1. Categorize memories - don’t dump everything into one pile. PROFILE memories rarely change. PREFERENCES need updates. EVENTS become stale quickly. Different categories need different retention policies.

2. Use embeddings for semantic retrieval. Keyword search fails when users express the same concept differently. “I hate verbose code” and “keep functions short” should retrieve the same preference.

3. Token budgets are real. Even with 200K context windows, you can’t stuff everything. I limit memory injection to 4K tokens and use recency + importance scoring to prioritize.

4. Memory extraction is lossy. The LLM won’t capture everything important. I keep raw conversation logs and re-extract memories periodically.

5. Privacy matters. Memory persistence means sensitive data persists too. I added user-level isolation and memory expiration policies.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments