Gmail Thread Token Bloat: How I Cut 4x Token Costs When Building AI Email Agents

Mar 17, 2026

Problem

I was building an AI agent to process Gmail threads and summarize conversations. After deploying to production, I noticed something strange: my API costs were way higher than expected.

A thread with ~11K tokens of unique content was consuming ~47K tokens per request.

Thread ID: 18a7b3c2d1e4f5g6
Unique content: 11,234 tokens
Raw API input: 47,891 tokens
Cost multiplier: 4.26x

I was paying 4x more than I should have. What was going on?

What Happened

I checked my Gmail API response and realized the problem immediately.

Message 1 (original):
"Hey, can we schedule a meeting for next Tuesday?"

Message 2 (reply):
"Sure, Tuesday works for me.

On Mon, Mar 10, 2026 at 9:00 AM, John wrote:
> Hey, can we schedule a meeting for next Tuesday?"

Message 3 (reply):
"Great, let's do 2pm.

On Mon, Mar 10, 2026 at 9:15 AM, Jane wrote:
> Sure, Tuesday works for me.
>
> On Mon, Mar 10, 2026 at 9:00 AM, John wrote:
> > Hey, can we schedule a meeting for next Tuesday?"

Every reply includes the full quoted history. Gmail’s API returns raw message bodies with all that quoted text.

For a 20-message thread, the math is brutal:

Message 1:  1 copy of original
Message 2:  1 copy of reply + 1 copy of message 1
Message 3:  1 copy of reply + 1 copy of message 2 (includes msg 1)
...
Message 20: 1 copy of reply + 19 copies of previous messages

That’s 20 copies of message 1, 19 copies of message 2, 18 copies of message 3… exponentially multiplying my token costs.

Why This Happens

Gmail’s design makes sense from a user perspective. Quoted replies help people understand context without scrolling. But for AI agents, this is a disaster.

from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build

def get_thread_messages(thread_id: str, creds: Credentials) -> list[str]:
    """Fetch all messages in a Gmail thread."""
    service = build('gmail', 'v1', credentials=creds)
    thread = service.users().threads().get(
        userId='me',
        id=thread_id
    ).execute()

    messages = []
    for msg in thread['messages']:
        # Get message body - includes ALL quoted history
        payload = msg['payload']
        body = extract_body(payload)  # Returns raw body with quotes
        messages.append(body)

    return messages  # Every message has full history attached!

def process_with_agent(messages: list[str]):
    """Feed raw messages to AI agent."""
    full_context = '\n\n'.join(messages)  # 4x token bloat here!
    response = llm.invoke(f"Summarize this thread:\n{full_context}")
    return response

I was feeding the agent a concatenated string where the same content appeared dozens of times.

The cost impact was significant:

Thread: 20 messages, 200 tokens unique content each
Naive approach: 20 + 19 + 18 + ... + 1 = 210 copies = 42,000 tokens
Optimized approach: 20 * 200 = 4,000 tokens
Savings: 38,000 tokens (90% reduction)

At $10 per million tokens, that’s $0.38 saved per thread. Across 10,000 threads per month, that’s $3,800 in wasted API calls.

The Solution

I needed to strip quoted content before feeding messages to the agent. Here’s my approach:

Step 1: Detect Quote Patterns

Different email clients format quotes differently:

import re

QUOTE_PATTERNS = [
    # Gmail-style: "On [date], [name] wrote:"
    r'^On.*wrote:.*$',
    # Standard quote prefix
    r'^>\s*.*$',
    # Outlook-style: "-----Original Message-----"
    r'^\-{4,}.*Original Message.*\-{4,}$',
    # Forward header
    r'^From:.*$',
    r'^Sent:.*$',
    r'^To:.*$',
    r'^Subject:.*$',
    # Apple Mail style
    r'^On [A-Z][a-z]+ \d+, \d+, at \d+:\d+.*,',
]

def is_quote_line(line: str) -> bool:
    """Check if a line is part of quoted history."""
    stripped = line.strip()
    if not stripped:
        return False

    for pattern in QUOTE_PATTERNS:
        if re.match(pattern, stripped, re.IGNORECASE):
            return True
    return False

Step 2: Extract Unique Content

Now I can strip quoted sections:

import re
from typing import List

def extract_unique_content(message_body: str) -> str:
    """
    Extract only new content from a message, removing quoted history.
    Returns the actual content the sender wrote.
    """
    lines = message_body.split('\n')
    unique_lines = []
    in_quote_block = False
    consecutive_quotes = 0

    for line in lines:
        # Detect if we're entering a quote block
        if is_quote_line(line):
            in_quote_block = True
            consecutive_quotes += 1
            continue

        # Check for quote indicator patterns
        if re.match(r'^On.*wrote:', line.strip(), re.IGNORECASE):
            in_quote_block = True
            consecutive_quotes += 1
            continue

        # Reset quote block if we hit normal text
        if in_quote_block and not line.strip().startswith('>'):
            # Only reset if this looks like new content
            if len(line.strip()) > 20:  # Non-trivial content
                in_quote_block = False
                consecutive_quotes = 0

        if not in_quote_block:
            unique_lines.append(line)

    return '\n'.join(unique_lines).strip()

Step 3: Process Thread for Agent

Now I can process an entire thread:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class ProcessedMessage:
    sender: str
    timestamp: str
    content: str  # Deduplicated content only

def process_thread_for_agent(messages: list[dict]) -> str:
    """
    Process Gmail thread into deduplicated format for AI agents.
    Returns clean chronological content with ~4x token reduction.
    """
    processed: List[ProcessedMessage] = []

    for msg in messages:
        # Extract message metadata
        headers = {h['name']: h['value'] for h in msg['payload']['headers']}
        sender = headers.get('From', 'Unknown')
        timestamp = headers.get('Date', '')

        # Get and deduplicate body
        body = extract_body(msg['payload'])
        unique_content = extract_unique_content(body)

        if unique_content:  # Only include if there's actual content
            processed.append(ProcessedMessage(
                sender=sender,
                timestamp=timestamp,
                content=unique_content
            ))

    return format_for_context(processed)

def format_for_context(messages: list[ProcessedMessage]) -> str:
    """Format deduplicated messages for optimal AI agent context."""
    formatted = []
    for msg in messages:
        formatted.append(
            f"[{msg.timestamp}]\n"
            f"From: {msg.sender}\n"
            f"{msg.content}"
        )

    return '\n\n---\n\n'.join(formatted)

Step 4: Use Message Headers for Structure (Better Approach)

For more reliable deduplication, I use Gmail’s threading headers:

def build_thread_graph(messages: list[dict]) -> dict:
    """
    Use Gmail headers to build proper thread structure.
    More reliable than quote detection.
    """
    thread_map = {}

    for msg in messages:
        headers = {h['name']: h['value'] for h in msg['payload']['headers']}
        msg_id = msg['id']
        in_reply_to = headers.get('In-Reply-To', '')
        references = headers.get('References', '').split()

        thread_map[msg_id] = {
            'content': extract_unique_content(extract_body(msg['payload'])),
            'sender': headers.get('From', 'Unknown'),
            'date': headers.get('Date', ''),
            'parent_id': in_reply_to,
            'references': references,
            'children': []
        }

    # Build parent-child relationships
    for msg_id, data in thread_map.items():
        parent_id = data['parent_id']
        if parent_id and parent_id in thread_map:
            thread_map[parent_id]['children'].append(msg_id)

    return thread_map

def reconstruct_thread(thread_map: dict) -> list[dict]:
    """Reconstruct thread in chronological order."""
    # Find root messages (no parent)
    roots = [msg_id for msg_id, data in thread_map.items()
             if not data['parent_id'] or data['parent_id'] not in thread_map]

    result = []
    for root_id in roots:
        result.extend(traverse_thread(root_id, thread_map))

    return result

def traverse_thread(msg_id: str, thread_map: dict) -> list[dict]:
    """Traverse thread tree depth-first."""
    data = thread_map[msg_id]
    result = [{
        'sender': data['sender'],
        'date': data['date'],
        'content': data['content']
    }]

    for child_id in data['children']:
        result.extend(traverse_thread(child_id, thread_map))

    return result

Results

After implementing deduplication:

Before:
- 20-message thread: 47,891 tokens
- Cost per thread: $0.48
- Monthly cost (10k threads): $4,800

After:
- 20-message thread: 11,234 tokens
- Cost per thread: $0.11
- Monthly cost (10k threads): $1,100

Savings: $3,700/month (77% reduction)

Handling Forwarded Messages

Forwarded chains are trickier. They collapse multiple conversations into one message body without structural markers.

def handle_forwarded_chains(body: str) -> list[str]:
    """
    Handle forwarded messages that embed entire conversations.
    Returns list of individual message bodies.
    """
    # Common forward delimiters
    forward_markers = [
        '---------- Forwarded message ----------',
        'Begin forwarded message:',
        '-----Original Message-----',
    ]

    segments = [body]

    for marker in forward_markers:
        new_segments = []
        for segment in segments:
            parts = segment.split(marker)
            new_segments.extend(parts)
        segments = new_segments

    return [s.strip() for s in segments if s.strip()]

Common Pitfalls

When implementing this, I hit several issues:

Pitfall 1: Over-aggressive quote stripping

# BAD: Strips too much
if '>' in line:  # Too simple!
    continue

# GOOD: Context-aware stripping
if is_quote_line(line) and not is_code_block(line):
    continue

Pitfall 2: Missing client-specific formats

# Add more patterns as you encounter them
QUOTE_PATTERNS.extend([
    r'^Am \d+\.\d+\.\d+ schrieb .*:',  # German Outlook
    r'^Le \d+ .* a ecrit:',  # French
    r'^Il \d+ .* ha scritto:',  # Italian
])

Pitfall 3: Not preserving thread structure

The deduplication must preserve the conversational flow. I learned to include metadata:

# Include who said what
formatted.append(
    f"[{msg.timestamp}] {msg.sender}:\n{msg.content}"
)

Summary

Gmail thread token bloat is a hidden cost multiplier. By implementing a preprocessing layer that strips quoted history before feeding messages to your AI agent, you can:

Reduce token usage by 4x
Lower API costs by 77%
Improve agent response quality (less noise in context)
Stay under rate limits more easily

The key is to treat Gmail’s raw output as pre-processing input, not as ready-to-consume agent context. Your token budget will thank you.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Gmail API Reference
👨‍💻 Goodhart's Law
👨‍💻 Token Usage in LLM APIs

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!