Skip to content

How Can You Reduce Token Usage When Web Scraping with AI Agents?

Problem

I built an AI agent to scrape product information from 50 e-commerce sites. My first version worked—but it cost $47 in tokens for a single run.

When I looked at where the tokens went, I found the problem:

Token Breakdown (Naive Approach):
- Raw HTML content: 847,000 tokens (89%)
- Agent reasoning: 52,000 tokens (5%)
- Planning and extraction: 53,000 tokens (6%)
Total: 952,000 tokens (~$47 at GPT-4 rates)

89% of my token budget went to raw HTML. The AI was processing navigation menus, footers, scripts, and ads just to extract product names and prices.

I needed a better approach.

The Naive Approach (Expensive)

My original architecture sent entire web pages to the AI:

naive-scraper.py
import openai
async def scrape_product(url: str):
# Fetch raw HTML
response = await fetch(url)
html = response.text
# Send entire page to AI
prompt = f"""
Extract product information from this HTML:
{html}
Return JSON with: name, price, description, availability
"""
result = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return parse_json(result.choices[0].message.content)

This worked but was expensive for several reasons:

  1. Large context: Raw HTML includes scripts, styles, navigation
  2. Duplicate processing: Same elements processed repeatedly
  3. No caching: Every request starts from scratch
  4. Token-heavy prompts: Sending entire pages each time

The Hybrid Approach (Cheap)

I redesigned the system to separate concerns:

  1. Firecrawl handles extraction, cleaning, and structuring
  2. AI Agent handles planning, reasoning, and validation

Here’s the new architecture:

Hybrid Architecture:
URL → Firecrawl (clean HTML, extract structure)
→ Structured Data (markdown, minimal)
→ AI Agent (plan extraction strategy)
→ Extracted Fields

The implementation:

hybrid-scraper.py
from firecrawl import FirecrawlApp
import openai
class HybridWebScraper:
def __init__(self, firecrawl_api_key: str, openai_api_key: str):
self.firecrawl = FirecrawlApp(api_key=firecrawl_api_key)
openai.api_key = openai_api_key
async def scrape_product(self, url: str) -> dict:
# Step 1: Firecrawl extracts and cleans
# Returns structured markdown, not raw HTML
firecrawl_result = await self.firecrawl.scrape_url(
url,
params={
'formats': ['markdown', 'html'],
'onlyMainContent': True, # Skip nav, footer, ads
'excludeTags': ['nav', 'footer', 'aside', 'script', 'style']
}
)
markdown = firecrawl_result['markdown']
# Step 2: AI plans extraction strategy
plan = await self.plan_extraction(markdown[:5000]) # Use first 5K chars
# Step 3: AI extracts based on plan
product = await self.extract_fields(markdown, plan)
return product
async def plan_extraction(self, sample_content: str) -> dict:
"""AI plans how to extract data - small context"""
prompt = f"""
Analyze this product page sample and plan extraction:
{sample_content}
Return JSON with extraction strategy:
{{
"price_pattern": "regex or selector hint",
"name_location": "likely location in content",
"description_section": "marker to find description"
}}
"""
result = await openai.chat.completions.create(
model="gpt-4o-mini", # Cheaper model for planning
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return parse_json(result.choices[0].message.content)
async def extract_fields(self, content: str, plan: dict) -> dict:
"""Extract fields using plan - targeted extraction"""
prompt = f"""
Extract product info using this strategy:
{json.dumps(plan, indent=2)}
Content:
{content[:10000]}
Return JSON:
{{ "name": "", "price": "", "description": "", "availability": "" }}
"""
result = await openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return parse_json(result.choices[0].message.content)

Cost Comparison

I ran both approaches on the same 50 URLs:

Naive Approach:
- Raw HTML per page: ~17,000 tokens average
- AI processing: ~19,000 tokens per page
- Total: 950,000 tokens
- Cost: $47.00
Hybrid Approach:
- Firecrawl processing: ~$0.02 per page = $1.00
- Cleaned markdown per page: ~3,000 tokens average
- AI planning: ~500 tokens per page
- AI extraction: ~2,500 tokens per page
- Total AI tokens: 150,000 tokens
- AI cost: $7.50
- Total cost: $8.50
Savings: 82%

The token reduction came from:

Token Reduction Breakdown:
- Removing scripts/styles: -40% tokens
- Removing nav/footer/ads: -25% tokens
- Converting HTML to markdown: -15% tokens
- Using smaller model for planning: -60% cost on that step
- Targeted extraction vs full parsing: -20% tokens

Advanced Optimization: Caching Plans

I noticed many e-commerce sites use similar layouts. Instead of planning extraction for each page, I cache plans by site template:

cached-scraper.py
import hashlib
from functools import lru_cache
class CachedHybridScraper:
def __init__(self):
self.plan_cache = {} # Domain -> extraction plan
async def scrape_with_cache(self, url: str) -> dict:
domain = extract_domain(url)
# Check cache for extraction plan
if domain in self.plan_cache:
plan = self.plan_cache[domain]
else:
# First time seeing this domain
sample = await self.get_sample_content(url)
plan = await self.plan_extraction(sample)
self.plan_cache[domain] = plan
# Extract using cached plan
content = await self.get_clean_content(url)
return await self.extract_fields(content, plan)
async def get_sample_content(self, url: str) -> str:
"""Get minimal sample for planning"""
result = await self.firecrawl.scrape_url(
url,
params={'formats': ['markdown'], 'onlyMainContent': True}
)
return result['markdown'][:2000] # Just first 2K chars

With caching, I reduced costs further:

Cached Hybrid Approach (50 URLs, 5 unique sites):
- Firecrawl: $1.00 (same)
- AI planning: 5 plans x 500 tokens = 2,500 tokens
- AI extraction: 50 pages x 2,500 tokens = 125,000 tokens
- Total AI tokens: 127,500 tokens
- AI cost: $6.37
- Total cost: $7.37
Additional savings: 13%
Overall savings vs naive: 84%

When to Use This Approach

The hybrid approach works best when:

Ideal Use Cases:
- Structured data extraction (products, articles, listings)
- Multiple pages from same sites (caching helps)
- High-volume scraping (cost savings multiply)
- Regular monitoring tasks (weekly price checks, etc.)
Less Suitable:
- One-off pages (overhead doesn't pay off)
- Unstructured content (creative writing, opinions)
- Sites that block scrapers (Firecrawl handles some, not all)

Complete Implementation

Here’s the full production-ready scraper:

production-scraper.py
import asyncio
from dataclasses import dataclass
from firecrawl import FirecrawlApp
import openai
from urllib.parse import urlparse
@dataclass
class ScraperConfig:
firecrawl_api_key: str
openai_api_key: str
max_retries: int = 3
cache_enabled: bool = True
planning_model: str = "gpt-4o-mini"
extraction_model: str = "gpt-4o-mini"
class ProductionWebScraper:
def __init__(self, config: ScraperConfig):
self.config = config
self.firecrawl = FirecrawlApp(api_key=config.firecrawl_api_key)
openai.api_key = config.openai_api_key
self.plan_cache: dict[str, dict] = {}
async def scrape_batch(self, urls: list[str]) -> list[dict]:
"""Scrape multiple URLs with rate limiting"""
results = []
for i, url in enumerate(urls):
try:
result = await self.scrape_single(url)
results.append({"url": url, "data": result, "success": True})
except Exception as e:
results.append({"url": url, "error": str(e), "success": False})
# Rate limiting
if i < len(urls) - 1:
await asyncio.sleep(1)
return results
async def scrape_single(self, url: str) -> dict:
"""Scrape single URL with hybrid approach"""
# Step 1: Get cleaned content via Firecrawl
content = await self._fetch_clean_content(url)
# Step 2: Get or create extraction plan
plan = await self._get_plan(url, content[:2000])
# Step 3: Extract data
data = await self._extract(content, plan)
return data
async def _fetch_clean_content(self, url: str) -> str:
"""Fetch and clean content via Firecrawl"""
result = await self.firecrawl.scrape_url(
url,
params={
'formats': ['markdown'],
'onlyMainContent': True,
'excludeTags': ['nav', 'footer', 'aside', 'script', 'style', 'header']
}
)
return result['markdown']
async def _get_plan(self, url: str, sample: str) -> dict:
"""Get cached plan or create new one"""
domain = urlparse(url).netloc
if self.config.cache_enabled and domain in self.plan_cache:
return self.plan_cache[domain]
plan = await self._create_plan(sample)
if self.config.cache_enabled:
self.plan_cache[domain] = plan
return plan
async def _create_plan(self, sample: str) -> dict:
"""Create extraction plan using AI"""
prompt = f"""
Analyze this content sample and create an extraction plan:
{sample}
Return JSON with:
- content_type: "product" | "article" | "listing" | "other"
- primary_fields: list of fields to extract
- field_hints: location hints for each field
"""
result = await openai.chat.completions.create(
model=self.config.planning_model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return json.loads(result.choices[0].message.content)
async def _extract(self, content: str, plan: dict) -> dict:
"""Extract fields using plan"""
prompt = f"""
Extract data using this plan:
{json.dumps(plan, indent=2)}
Content:
{content[:15000]}
Return extracted data as JSON.
"""
result = await openai.chat.completions.create(
model=self.config.extraction_model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return json.loads(result.choices[0].message.content)

Summary

In this post, I showed how to reduce AI agent web scraping costs by 70-90% using a hybrid approach. The key strategies:

  • Use Firecrawl for cleaning: Remove scripts, styles, navigation before AI sees it
  • Separate planning from execution: Small model plans, cheaper model extracts
  • Cache extraction plans: Similar pages don’t need re-planning
  • Use smaller models: gpt-4o-mini is 20x cheaper than gpt-4 for extraction tasks

The hybrid approach separates concerns: Firecrawl handles the heavy lifting of parsing and cleaning HTML, while AI handles the intelligent work of understanding structure and extracting data. This division of labor dramatically reduces token consumption while maintaining extraction quality.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments