Why AI Agents Forget Everything Between Sessions (And How to Fix It)

Mar 21, 2026

I spent 20 minutes explaining my project architecture to Claude yesterday. Today, it asked me what framework I’m using. Again.

“You tell your agent to restructure your portfolio on Monday, explain your risk tolerance, walk through the rationale. Wednesday it asks your risk tolerance again from scratch. 20 minutes and 30K+ tokens gone on something you already discussed.”

This isn’t a bug. It’s the fundamental design of how LLMs work. And until I understood why this happens, I kept trying solutions that were doomed from the start.

The Problem: Stateless by Design

Here’s what I tried first. I created a MEMORY.md file:

# Project Context

## User Preferences
- Prefers TypeScript over JavaScript
- Uses functional programming style
- Wants detailed comments in code

## Current Project
- E-commerce platform
- Tech stack: Next.js, PostgreSQL, Prisma
- Authentication: Clerk

Then I instructed my agent to “always read MEMORY.md before responding.”

This worked for about a week. Then the file grew to 15KB. My agent started skipping sections. Then it started ignoring the file entirely because “it’s too long to read every time.”

The real issue? I was treating a symptom, not the disease.

Why LLMs Can’t Remember

LLMs are fundamentally stateless. Let me show you what actually happens in each session:

┌─────────────────────────────────────────────────────────┐
│  Input: [System Prompt] + [Your Message] + [Context]    │
│                          ↓                              │
│                    LLM Processing                        │
│                          ↓                              │
│  Output: [Response]                                     │
│                                                         │
│  Memory State: NONE (everything discarded after output) │
└─────────────────────────────────────────────────────────┘

When you send a message to Claude or GPT:

The API receives your message
It appends to the conversation history (up to token limit)
The model processes and generates a response
Everything is discarded after the response is sent

There’s no hidden database storing your preferences. No secret learning happening in the background. Each API call is isolated.

The Token Limit Trap

I thought larger context windows would solve this. GPT-4 Turbo has 128K tokens. Claude has 200K tokens. That’s a lot of room, right?

Wrong. Here’s what I learned the hard way:

┌──────────────────────────────────────────────────────┐
│  Total Context Window: 128K tokens                   │
│  ─────────────────────────────────────────────────────│
│  System Prompt:           ~2K tokens                  │
│  Conversation History:    ~50K tokens (past sessions)│
│  Current Message:         ~1K tokens                  │
│  ─────────────────────────────────────────────────────│
│  Remaining for Response: ~75K tokens                 │
│                                                       │
│  Problem: Attention quality DEGRADES with length     │
│  Cost: Every token costs money, even old context     │
└──────────────────────────────────────────────────────┘

I ran up a $400 API bill in one month re-sending the same context. The model’s attention also degraded - it would “forget” instructions buried in long conversation histories.

Failed Solutions I Tried

Attempt 1: RAG (Retrieval-Augmented Generation)

I built a RAG system to retrieve relevant context:

from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings

def get_relevant_context(query: str) -> list[str]:
    """Retrieve relevant documents for context."""
    embeddings = OpenAIEmbeddings()
    vectorstore = Pinecone.from_existing_index(
        index_name="project-memory",
        embedding=embeddings
    )

    results = vectorstore.similarity_search(
        query=query,
        k=5  # top 5 relevant chunks
    )

    return [doc.page_content for doc in results]

The problem? The agent doesn’t know what to retrieve. I still had to manually specify “search for project architecture” or “look up user preferences.” The awareness problem remained.

Attempt 2: Summary Chain

I tried summarizing each conversation and storing it:

def summarize_conversation(messages: list[dict]) -> str:
    """Summarize conversation for persistent storage."""
    summary_prompt = f"""
    Summarize the key decisions, preferences, and context from this conversation:
    {format_messages(messages)}

    Focus on:
    - User preferences stated
    - Technical decisions made
    - Constraints established
    - Questions left unresolved
    """

    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1000,
        messages=[{"role": "user", "content": summary_prompt}]
    )

    return response.content[0].text

This helped, but summaries grow stale. “User prefers dark mode” is useful. “User is debugging the auth flow in login.tsx line 47” is obsolete after 2 days.

The Solution: Structured Persistent Memory

After months of frustration, I studied how production AI agent systems handle this. They use a three-component architecture:

+------------------+     +-------------------+     +------------------+
|   Memory Extract | --> |   Memory Store    | --> |  Memory Inject   |
|   (What to save) |     |   (Where to save) |     |  (When to use)   |
+------------------+     +-------------------+     +------------------+
        │                        │                        │
        ↓                        ↓                        ↓
   Classification           Database Query          Context Assembly
   Extraction              CRUD Operations         Token Management

Component 1: Memory Extraction

First, I needed to identify what to remember. Not everything matters:

from enum import Enum
from typing import Optional
from pydantic import BaseModel

class MemoryCategory(str, Enum):
    PROFILE = "profile"         # User info: name, role, company
    PREFERENCES = "preferences" # Aggregated by topic
    ENTITIES = "entities"       # Projects, people, concepts
    EVENTS = "events"           # Decisions, milestones
    CASES = "cases"             # Specific problems + solutions
    PATTERNS = "patterns"       # Reusable processes/methods

class Memory(BaseModel):
    id: str
    category: MemoryCategory
    content: str
    metadata: dict
    created_at: str
    last_accessed: str
    access_count: int = 0

def extract_memories(conversation: list[dict]) -> list[Memory]:
    """Extract structured memories from conversation."""
    extraction_prompt = """
    Analyze this conversation and extract structured memories.

    Categories:
    - PROFILE: User identity information
    - PREFERENCES: Stated preferences (coding style, tools, etc.)
    - ENTITIES: Named items (projects, APIs, libraries)
    - EVENTS: Decisions made, milestones reached
    - CASES: Problem-solution pairs
    - PATTERNS: Reusable workflows discovered

    For each memory, provide:
    1. Category
    2. Content (concise, factual)
    3. Confidence (how certain is this memory)
    4. Importance (1-10 scale)
    """

    # Call LLM to extract memories
    extracted = call_extraction_model(conversation, extraction_prompt)

    # Filter by confidence and importance
    return [
        Memory(
            id=generate_id(),
            category=m["category"],
            content=m["content"],
            metadata={"confidence": m["confidence"]},
            created_at=datetime.utcnow().isoformat(),
            last_accessed=datetime.utcnow().isoformat()
        )
        for m in extracted
        if m["confidence"] > 0.7 and m["importance"] > 3
    ]

Component 2: Memory Storage

Next, I needed a database that supports efficient retrieval:

CREATE TABLE memories (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    category VARCHAR(50) NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),  -- OpenAI embedding dimension
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    last_accessed TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    access_count INTEGER DEFAULT 0,
    importance INTEGER DEFAULT 5
);

-- Index for fast category filtering
CREATE INDEX idx_memories_category ON memories(category);

-- Index for vector similarity search
CREATE INDEX idx_memories_embedding ON memories
USING ivfflat (embedding vector_cosine_ops);

-- Index for recency-based queries
CREATE INDEX idx_memories_accessed ON memories(last_accessed DESC);

import asyncpg
from openai import OpenAI

class MemoryStore:
    def __init__(self, db_url: str):
        self.db_url = db_url
        self.client = OpenAI()

    async def store(self, memory: Memory) -> None:
        """Store a memory with its embedding."""
        embedding = self._get_embedding(memory.content)

        async with asyncpg.connect(self.db_url) as conn:
            await conn.execute("""
                INSERT INTO memories (category, content, embedding, metadata)
                VALUES ($1, $2, $3, $4)
            """, memory.category, memory.content, embedding, memory.metadata)

    async def retrieve_relevant(
        self,
        query: str,
        categories: list[MemoryCategory] = None,
        limit: int = 10
    ) -> list[Memory]:
        """Retrieve memories relevant to a query."""
        query_embedding = self._get_embedding(query)

        async with asyncpg.connect(self.db_url) as conn:
            sql = """
                SELECT id, category, content, metadata, created_at,
                       1 - (embedding <=> $1) as similarity
                FROM memories
                WHERE ($2::varchar[] IS NULL OR category = ANY($2))
                ORDER BY similarity DESC
                LIMIT $3
            """
            rows = await conn.fetch(
                sql,
                query_embedding,
                categories,
                limit
            )

            # Update access statistics
            for row in rows:
                await conn.execute("""
                    UPDATE memories
                    SET last_accessed = NOW(),
                        access_count = access_count + 1
                    WHERE id = $1
                """, row["id"])

            return [Memory(**row) for row in rows]

    def _get_embedding(self, text: str) -> list[float]:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

Component 3: Memory Injection

Finally, I needed to inject relevant memories at the right time:

from typing import Optional

class MemoryInjector:
    def __init__(self, memory_store: MemoryStore, token_limit: int = 4000):
        self.store = memory_store
        self.token_limit = token_limit

    async def inject_context(
        self,
        system_prompt: str,
        user_message: str,
        conversation_history: list[dict] = None
    ) -> str:
        """Build augmented system prompt with relevant memories."""

        # Analyze what memories are needed
        memory_query = await self._determine_memory_needs(
            user_message,
            conversation_history
        )

        # Retrieve relevant memories
        memories = await self.store.retrieve_relevant(
            query=memory_query,
            limit=20  # Limit to avoid token overflow
        )

        # Format memories for injection
        memory_context = self._format_memories(memories)

        # Calculate token usage
        memory_tokens = self._estimate_tokens(memory_context)

        if memory_tokens > self.token_limit:
            # Truncate if necessary
            memory_context = self._truncate_context(
                memory_context,
                self.token_limit
            )

        # Inject into system prompt
        augmented_prompt = f"""
{system_prompt}

## Persistent Memory

{memory_context}

## End Persistent Memory
"""
        return augmented_prompt

    async def _determine_memory_needs(
        self,
        message: str,
        history: list[dict] = None
    ) -> str:
        """Determine what memories to retrieve based on context."""
        # Combine current message with recent history
        context = message
        if history:
            recent = history[-3:]  # Last 3 messages
            context = " ".join([m["content"] for m in recent]) + " " + message

        # Use LLM to identify memory categories needed
        categorization = await self._categorize_needs(context)

        return categorization

    def _format_memories(self, memories: list[Memory]) -> str:
        """Format memories for injection into prompt."""
        sections = {}

        for memory in memories:
            if memory.category not in sections:
                sections[memory.category] = []
            sections[memory.category].append(memory.content)

        output = []
        for category, items in sections.items():
            output.append(f"### {category.upper()}")
            for item in items:
                output.append(f"- {item}")
            output.append("")

        return "\n".join(output)

Putting It All Together

Here’s the complete flow I now use:

from langchain_core.messages import HumanMessage, AIMessage
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, END

class StatefulAgent:
    def __init__(self):
        self.extractor = MemoryExtractor()
        self.store = MemoryStore(DB_URL)
        self.injector = MemoryInjector(self.store)
        self.llm = ChatAnthropic(model="claude-3-sonnet-20240229")

    async def chat(self, user_id: str, message: str) -> str:
        """Process a message with persistent memory."""

        # 1. Retrieve relevant memories
        system_prompt = await self.injector.inject_context(
            system_prompt="You are a helpful coding assistant.",
            user_message=message
        )

        # 2. Get conversation history (if any)
        history = await self.store.get_conversation_history(user_id)

        # 3. Generate response
        response = await self.llm.ainvoke([
            {"role": "system", "content": system_prompt},
            *history,
            {"role": "user", "content": message}
        ])

        # 4. Extract and store new memories
        new_memories = self.extractor.extract_memories([
            {"role": "user", "content": message},
            {"role": "assistant", "content": response.content}
        ])

        for memory in new_memories:
            await self.store.store(user_id, memory)

        # 5. Update conversation history
        await self.store.append_history(
            user_id,
            [
                {"role": "user", "content": message},
                {"role": "assistant", "content": response.content}
            ]
        )

        return response.content

Key Lessons Learned

After implementing this system, my API costs dropped by 60% and my agent actually remembers things between sessions. Here’s what I learned:

1. Categorize memories - don’t dump everything into one pile. PROFILE memories rarely change. PREFERENCES need updates. EVENTS become stale quickly. Different categories need different retention policies.

2. Use embeddings for semantic retrieval. Keyword search fails when users express the same concept differently. “I hate verbose code” and “keep functions short” should retrieve the same preference.

3. Token budgets are real. Even with 200K context windows, you can’t stuff everything. I limit memory injection to 4K tokens and use recency + importance scoring to prioritize.

4. Memory extraction is lossy. The LLM won’t capture everything important. I keep raw conversation logs and re-extract memories periodically.

5. Privacy matters. Memory persistence means sensitive data persists too. I added user-level isolation and memory expiration policies.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!