Gmail Thread Token Bloat: How I Cut 4x Token Costs When Building AI Email Agents
Problem
I was building an AI agent to process Gmail threads and summarize conversations. After deploying to production, I noticed something strange: my API costs were way higher than expected.
A thread with ~11K tokens of unique content was consuming ~47K tokens per request.
Thread ID: 18a7b3c2d1e4f5g6Unique content: 11,234 tokensRaw API input: 47,891 tokensCost multiplier: 4.26xI was paying 4x more than I should have. What was going on?
What Happened
I checked my Gmail API response and realized the problem immediately.
Message 1 (original):"Hey, can we schedule a meeting for next Tuesday?"
Message 2 (reply):"Sure, Tuesday works for me.
On Mon, Mar 10, 2026 at 9:00 AM, John wrote:> Hey, can we schedule a meeting for next Tuesday?"
Message 3 (reply):"Great, let's do 2pm.
On Mon, Mar 10, 2026 at 9:15 AM, Jane wrote:> Sure, Tuesday works for me.>> On Mon, Mar 10, 2026 at 9:00 AM, John wrote:> > Hey, can we schedule a meeting for next Tuesday?"Every reply includes the full quoted history. Gmail’s API returns raw message bodies with all that quoted text.
For a 20-message thread, the math is brutal:
Message 1: 1 copy of originalMessage 2: 1 copy of reply + 1 copy of message 1Message 3: 1 copy of reply + 1 copy of message 2 (includes msg 1)...Message 20: 1 copy of reply + 19 copies of previous messagesThat’s 20 copies of message 1, 19 copies of message 2, 18 copies of message 3… exponentially multiplying my token costs.
Why This Happens
Gmail’s design makes sense from a user perspective. Quoted replies help people understand context without scrolling. But for AI agents, this is a disaster.
from google.oauth2.credentials import Credentialsfrom googleapiclient.discovery import build
def get_thread_messages(thread_id: str, creds: Credentials) -> list[str]: """Fetch all messages in a Gmail thread.""" service = build('gmail', 'v1', credentials=creds) thread = service.users().threads().get( userId='me', id=thread_id ).execute()
messages = [] for msg in thread['messages']: # Get message body - includes ALL quoted history payload = msg['payload'] body = extract_body(payload) # Returns raw body with quotes messages.append(body)
return messages # Every message has full history attached!
def process_with_agent(messages: list[str]): """Feed raw messages to AI agent.""" full_context = '\n\n'.join(messages) # 4x token bloat here! response = llm.invoke(f"Summarize this thread:\n{full_context}") return responseI was feeding the agent a concatenated string where the same content appeared dozens of times.
The cost impact was significant:
Thread: 20 messages, 200 tokens unique content eachNaive approach: 20 + 19 + 18 + ... + 1 = 210 copies = 42,000 tokensOptimized approach: 20 * 200 = 4,000 tokensSavings: 38,000 tokens (90% reduction)At $10 per million tokens, that’s $0.38 saved per thread. Across 10,000 threads per month, that’s $3,800 in wasted API calls.
The Solution
I needed to strip quoted content before feeding messages to the agent. Here’s my approach:
Step 1: Detect Quote Patterns
Different email clients format quotes differently:
import re
QUOTE_PATTERNS = [ # Gmail-style: "On [date], [name] wrote:" r'^On.*wrote:.*$', # Standard quote prefix r'^>\s*.*$', # Outlook-style: "-----Original Message-----" r'^\-{4,}.*Original Message.*\-{4,}$', # Forward header r'^From:.*$', r'^Sent:.*$', r'^To:.*$', r'^Subject:.*$', # Apple Mail style r'^On [A-Z][a-z]+ \d+, \d+, at \d+:\d+.*,',]
def is_quote_line(line: str) -> bool: """Check if a line is part of quoted history.""" stripped = line.strip() if not stripped: return False
for pattern in QUOTE_PATTERNS: if re.match(pattern, stripped, re.IGNORECASE): return True return FalseStep 2: Extract Unique Content
Now I can strip quoted sections:
import refrom typing import List
def extract_unique_content(message_body: str) -> str: """ Extract only new content from a message, removing quoted history. Returns the actual content the sender wrote. """ lines = message_body.split('\n') unique_lines = [] in_quote_block = False consecutive_quotes = 0
for line in lines: # Detect if we're entering a quote block if is_quote_line(line): in_quote_block = True consecutive_quotes += 1 continue
# Check for quote indicator patterns if re.match(r'^On.*wrote:', line.strip(), re.IGNORECASE): in_quote_block = True consecutive_quotes += 1 continue
# Reset quote block if we hit normal text if in_quote_block and not line.strip().startswith('>'): # Only reset if this looks like new content if len(line.strip()) > 20: # Non-trivial content in_quote_block = False consecutive_quotes = 0
if not in_quote_block: unique_lines.append(line)
return '\n'.join(unique_lines).strip()Step 3: Process Thread for Agent
Now I can process an entire thread:
from dataclasses import dataclassfrom datetime import datetime
@dataclassclass ProcessedMessage: sender: str timestamp: str content: str # Deduplicated content only
def process_thread_for_agent(messages: list[dict]) -> str: """ Process Gmail thread into deduplicated format for AI agents. Returns clean chronological content with ~4x token reduction. """ processed: List[ProcessedMessage] = []
for msg in messages: # Extract message metadata headers = {h['name']: h['value'] for h in msg['payload']['headers']} sender = headers.get('From', 'Unknown') timestamp = headers.get('Date', '')
# Get and deduplicate body body = extract_body(msg['payload']) unique_content = extract_unique_content(body)
if unique_content: # Only include if there's actual content processed.append(ProcessedMessage( sender=sender, timestamp=timestamp, content=unique_content ))
return format_for_context(processed)
def format_for_context(messages: list[ProcessedMessage]) -> str: """Format deduplicated messages for optimal AI agent context.""" formatted = [] for msg in messages: formatted.append( f"[{msg.timestamp}]\n" f"From: {msg.sender}\n" f"{msg.content}" )
return '\n\n---\n\n'.join(formatted)Step 4: Use Message Headers for Structure (Better Approach)
For more reliable deduplication, I use Gmail’s threading headers:
def build_thread_graph(messages: list[dict]) -> dict: """ Use Gmail headers to build proper thread structure. More reliable than quote detection. """ thread_map = {}
for msg in messages: headers = {h['name']: h['value'] for h in msg['payload']['headers']} msg_id = msg['id'] in_reply_to = headers.get('In-Reply-To', '') references = headers.get('References', '').split()
thread_map[msg_id] = { 'content': extract_unique_content(extract_body(msg['payload'])), 'sender': headers.get('From', 'Unknown'), 'date': headers.get('Date', ''), 'parent_id': in_reply_to, 'references': references, 'children': [] }
# Build parent-child relationships for msg_id, data in thread_map.items(): parent_id = data['parent_id'] if parent_id and parent_id in thread_map: thread_map[parent_id]['children'].append(msg_id)
return thread_map
def reconstruct_thread(thread_map: dict) -> list[dict]: """Reconstruct thread in chronological order.""" # Find root messages (no parent) roots = [msg_id for msg_id, data in thread_map.items() if not data['parent_id'] or data['parent_id'] not in thread_map]
result = [] for root_id in roots: result.extend(traverse_thread(root_id, thread_map))
return result
def traverse_thread(msg_id: str, thread_map: dict) -> list[dict]: """Traverse thread tree depth-first.""" data = thread_map[msg_id] result = [{ 'sender': data['sender'], 'date': data['date'], 'content': data['content'] }]
for child_id in data['children']: result.extend(traverse_thread(child_id, thread_map))
return resultResults
After implementing deduplication:
Before:- 20-message thread: 47,891 tokens- Cost per thread: $0.48- Monthly cost (10k threads): $4,800
After:- 20-message thread: 11,234 tokens- Cost per thread: $0.11- Monthly cost (10k threads): $1,100
Savings: $3,700/month (77% reduction)Handling Forwarded Messages
Forwarded chains are trickier. They collapse multiple conversations into one message body without structural markers.
def handle_forwarded_chains(body: str) -> list[str]: """ Handle forwarded messages that embed entire conversations. Returns list of individual message bodies. """ # Common forward delimiters forward_markers = [ '---------- Forwarded message ----------', 'Begin forwarded message:', '-----Original Message-----', ]
segments = [body]
for marker in forward_markers: new_segments = [] for segment in segments: parts = segment.split(marker) new_segments.extend(parts) segments = new_segments
return [s.strip() for s in segments if s.strip()]Common Pitfalls
When implementing this, I hit several issues:
Pitfall 1: Over-aggressive quote stripping
# BAD: Strips too muchif '>' in line: # Too simple! continue
# GOOD: Context-aware strippingif is_quote_line(line) and not is_code_block(line): continuePitfall 2: Missing client-specific formats
# Add more patterns as you encounter themQUOTE_PATTERNS.extend([ r'^Am \d+\.\d+\.\d+ schrieb .*:', # German Outlook r'^Le \d+ .* a ecrit:', # French r'^Il \d+ .* ha scritto:', # Italian])Pitfall 3: Not preserving thread structure
The deduplication must preserve the conversational flow. I learned to include metadata:
# Include who said whatformatted.append( f"[{msg.timestamp}] {msg.sender}:\n{msg.content}")Summary
Gmail thread token bloat is a hidden cost multiplier. By implementing a preprocessing layer that strips quoted history before feeding messages to your AI agent, you can:
- Reduce token usage by 4x
- Lower API costs by 77%
- Improve agent response quality (less noise in context)
- Stay under rate limits more easily
The key is to treat Gmail’s raw output as pre-processing input, not as ready-to-consume agent context. Your token budget will thank you.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Gmail API Reference
- 👨💻 Goodhart's Law
- 👨💻 Token Usage in LLM APIs
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments