How Can You Reduce Token Usage When Web Scraping with AI Agents?
Problem
I built an AI agent to scrape product information from 50 e-commerce sites. My first version worked—but it cost $47 in tokens for a single run.
When I looked at where the tokens went, I found the problem:
Token Breakdown (Naive Approach):- Raw HTML content: 847,000 tokens (89%)- Agent reasoning: 52,000 tokens (5%)- Planning and extraction: 53,000 tokens (6%)Total: 952,000 tokens (~$47 at GPT-4 rates)89% of my token budget went to raw HTML. The AI was processing navigation menus, footers, scripts, and ads just to extract product names and prices.
I needed a better approach.
The Naive Approach (Expensive)
My original architecture sent entire web pages to the AI:
import openai
async def scrape_product(url: str): # Fetch raw HTML response = await fetch(url) html = response.text
# Send entire page to AI prompt = f""" Extract product information from this HTML:
{html}
Return JSON with: name, price, description, availability """
result = await openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] )
return parse_json(result.choices[0].message.content)This worked but was expensive for several reasons:
- Large context: Raw HTML includes scripts, styles, navigation
- Duplicate processing: Same elements processed repeatedly
- No caching: Every request starts from scratch
- Token-heavy prompts: Sending entire pages each time
The Hybrid Approach (Cheap)
I redesigned the system to separate concerns:
- Firecrawl handles extraction, cleaning, and structuring
- AI Agent handles planning, reasoning, and validation
Here’s the new architecture:
Hybrid Architecture:
URL → Firecrawl (clean HTML, extract structure) → Structured Data (markdown, minimal) → AI Agent (plan extraction strategy) → Extracted FieldsThe implementation:
from firecrawl import FirecrawlAppimport openai
class HybridWebScraper: def __init__(self, firecrawl_api_key: str, openai_api_key: str): self.firecrawl = FirecrawlApp(api_key=firecrawl_api_key) openai.api_key = openai_api_key
async def scrape_product(self, url: str) -> dict: # Step 1: Firecrawl extracts and cleans # Returns structured markdown, not raw HTML firecrawl_result = await self.firecrawl.scrape_url( url, params={ 'formats': ['markdown', 'html'], 'onlyMainContent': True, # Skip nav, footer, ads 'excludeTags': ['nav', 'footer', 'aside', 'script', 'style'] } )
markdown = firecrawl_result['markdown']
# Step 2: AI plans extraction strategy plan = await self.plan_extraction(markdown[:5000]) # Use first 5K chars
# Step 3: AI extracts based on plan product = await self.extract_fields(markdown, plan)
return product
async def plan_extraction(self, sample_content: str) -> dict: """AI plans how to extract data - small context""" prompt = f""" Analyze this product page sample and plan extraction:
{sample_content}
Return JSON with extraction strategy: {{ "price_pattern": "regex or selector hint", "name_location": "likely location in content", "description_section": "marker to find description" }} """
result = await openai.chat.completions.create( model="gpt-4o-mini", # Cheaper model for planning messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} )
return parse_json(result.choices[0].message.content)
async def extract_fields(self, content: str, plan: dict) -> dict: """Extract fields using plan - targeted extraction""" prompt = f""" Extract product info using this strategy: {json.dumps(plan, indent=2)}
Content: {content[:10000]}
Return JSON: {{ "name": "", "price": "", "description": "", "availability": "" }} """
result = await openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} )
return parse_json(result.choices[0].message.content)Cost Comparison
I ran both approaches on the same 50 URLs:
Naive Approach:- Raw HTML per page: ~17,000 tokens average- AI processing: ~19,000 tokens per page- Total: 950,000 tokens- Cost: $47.00
Hybrid Approach:- Firecrawl processing: ~$0.02 per page = $1.00- Cleaned markdown per page: ~3,000 tokens average- AI planning: ~500 tokens per page- AI extraction: ~2,500 tokens per page- Total AI tokens: 150,000 tokens- AI cost: $7.50- Total cost: $8.50
Savings: 82%The token reduction came from:
Token Reduction Breakdown:- Removing scripts/styles: -40% tokens- Removing nav/footer/ads: -25% tokens- Converting HTML to markdown: -15% tokens- Using smaller model for planning: -60% cost on that step- Targeted extraction vs full parsing: -20% tokensAdvanced Optimization: Caching Plans
I noticed many e-commerce sites use similar layouts. Instead of planning extraction for each page, I cache plans by site template:
import hashlibfrom functools import lru_cache
class CachedHybridScraper: def __init__(self): self.plan_cache = {} # Domain -> extraction plan
async def scrape_with_cache(self, url: str) -> dict: domain = extract_domain(url)
# Check cache for extraction plan if domain in self.plan_cache: plan = self.plan_cache[domain] else: # First time seeing this domain sample = await self.get_sample_content(url) plan = await self.plan_extraction(sample) self.plan_cache[domain] = plan
# Extract using cached plan content = await self.get_clean_content(url) return await self.extract_fields(content, plan)
async def get_sample_content(self, url: str) -> str: """Get minimal sample for planning""" result = await self.firecrawl.scrape_url( url, params={'formats': ['markdown'], 'onlyMainContent': True} ) return result['markdown'][:2000] # Just first 2K charsWith caching, I reduced costs further:
Cached Hybrid Approach (50 URLs, 5 unique sites):- Firecrawl: $1.00 (same)- AI planning: 5 plans x 500 tokens = 2,500 tokens- AI extraction: 50 pages x 2,500 tokens = 125,000 tokens- Total AI tokens: 127,500 tokens- AI cost: $6.37- Total cost: $7.37
Additional savings: 13%Overall savings vs naive: 84%When to Use This Approach
The hybrid approach works best when:
Ideal Use Cases:- Structured data extraction (products, articles, listings)- Multiple pages from same sites (caching helps)- High-volume scraping (cost savings multiply)- Regular monitoring tasks (weekly price checks, etc.)
Less Suitable:- One-off pages (overhead doesn't pay off)- Unstructured content (creative writing, opinions)- Sites that block scrapers (Firecrawl handles some, not all)Complete Implementation
Here’s the full production-ready scraper:
import asynciofrom dataclasses import dataclassfrom firecrawl import FirecrawlAppimport openaifrom urllib.parse import urlparse
@dataclassclass ScraperConfig: firecrawl_api_key: str openai_api_key: str max_retries: int = 3 cache_enabled: bool = True planning_model: str = "gpt-4o-mini" extraction_model: str = "gpt-4o-mini"
class ProductionWebScraper: def __init__(self, config: ScraperConfig): self.config = config self.firecrawl = FirecrawlApp(api_key=config.firecrawl_api_key) openai.api_key = config.openai_api_key self.plan_cache: dict[str, dict] = {}
async def scrape_batch(self, urls: list[str]) -> list[dict]: """Scrape multiple URLs with rate limiting""" results = [] for i, url in enumerate(urls): try: result = await self.scrape_single(url) results.append({"url": url, "data": result, "success": True}) except Exception as e: results.append({"url": url, "error": str(e), "success": False})
# Rate limiting if i < len(urls) - 1: await asyncio.sleep(1)
return results
async def scrape_single(self, url: str) -> dict: """Scrape single URL with hybrid approach""" # Step 1: Get cleaned content via Firecrawl content = await self._fetch_clean_content(url)
# Step 2: Get or create extraction plan plan = await self._get_plan(url, content[:2000])
# Step 3: Extract data data = await self._extract(content, plan)
return data
async def _fetch_clean_content(self, url: str) -> str: """Fetch and clean content via Firecrawl""" result = await self.firecrawl.scrape_url( url, params={ 'formats': ['markdown'], 'onlyMainContent': True, 'excludeTags': ['nav', 'footer', 'aside', 'script', 'style', 'header'] } ) return result['markdown']
async def _get_plan(self, url: str, sample: str) -> dict: """Get cached plan or create new one""" domain = urlparse(url).netloc
if self.config.cache_enabled and domain in self.plan_cache: return self.plan_cache[domain]
plan = await self._create_plan(sample)
if self.config.cache_enabled: self.plan_cache[domain] = plan
return plan
async def _create_plan(self, sample: str) -> dict: """Create extraction plan using AI""" prompt = f""" Analyze this content sample and create an extraction plan:
{sample}
Return JSON with: - content_type: "product" | "article" | "listing" | "other" - primary_fields: list of fields to extract - field_hints: location hints for each field """
result = await openai.chat.completions.create( model=self.config.planning_model, messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} )
return json.loads(result.choices[0].message.content)
async def _extract(self, content: str, plan: dict) -> dict: """Extract fields using plan""" prompt = f""" Extract data using this plan: {json.dumps(plan, indent=2)}
Content: {content[:15000]}
Return extracted data as JSON. """
result = await openai.chat.completions.create( model=self.config.extraction_model, messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} )
return json.loads(result.choices[0].message.content)Summary
In this post, I showed how to reduce AI agent web scraping costs by 70-90% using a hybrid approach. The key strategies:
- Use Firecrawl for cleaning: Remove scripts, styles, navigation before AI sees it
- Separate planning from execution: Small model plans, cheaper model extracts
- Cache extraction plans: Similar pages don’t need re-planning
- Use smaller models: gpt-4o-mini is 20x cheaper than gpt-4 for extraction tasks
The hybrid approach separates concerns: Firecrawl handles the heavy lifting of parsing and cleaning HTML, while AI handles the intelligent work of understanding structure and extracting data. This division of labor dramatically reduces token consumption while maintaining extraction quality.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments