How to optimize LangExtract for large documents without hitting rate limits or blowing your budget
Problem
When I tried to extract entities from Romeo & Juliet using LangExtract, I hit multiple issues:
import langextract as lx
result = lx.extract( text="https://www.gutenberg.org/files/1513/1513-0.txt", model_id="gemini-2.5-flash", max_workers=20, extraction_passes=3)I got this error:
Resource exhausted: Rate limit exceeded. You can make 15 requests per minute.Even when I reduced workers to 5, the processing was slow. When I tried with gemini-2.5-pro for better accuracy, the cost blew out to $27 for a single document.
The official docs show basic usage examples but don’t explain how to handle production-scale documents efficiently.
Environment
- Python 3.11
- langextract 0.6.2
- Google Vertex AI (Gemini 2.5 Flash/Pro)
- Document: Romeo & Juliet (147K characters, ~44K tokens)
What happened?
I wanted to extract character names and locations from a full novel. LangExtract worked great on short texts, but failed at scale.
Here’s what I tried first:
result = lx.extract( text="https://www.gutenberg.org/files/1513/1513-0.txt", prompt_description="Extract character names and locations", examples=[...], model_id="gemini-2.5-flash", extraction_passes=3)This ran for a while then failed with the rate limit error. The problem: LangExtract chunks the text and processes chunks in parallel. With 20 workers and large documents, you hit the 15 requests/minute limit quickly.
Even when it worked, the costs were high. I calculated the actual expense and realized I needed a better approach.
Solution #1: Calculate costs before running
Before processing any large document, calculate the estimated cost:
COSTS = { "gemini-2.5-flash": 0.60, # $0.60 per 1M tokens "gemini-2.5-pro": 7.50, # $7.50 per 1M tokens "gemini-2.0-flash": 0.30, # $0.30 per 1M tokens (cheapest!)}
def estimate_cost(text: str, model_id: str, passes: int = 1) -> float: """Estimate extraction cost in USD.""" # Rough tokenization: ~4 chars per token estimated_tokens = len(text) / 4 cost_per_million = COSTS.get(model_id, 1.0) return (estimated_tokens / 1_000_000) * cost_per_million * passes
# Usagetext = open("romeo_juliet.txt").read()flash_cost = estimate_cost(text, "gemini-2.5-flash", passes=3)pro_cost = estimate_cost(text, "gemini-2.5-pro", passes=3)
print(f"Flash (3 passes): ${flash_cost:.2f}")print(f"Pro (3 passes): ${pro_cost:.2f}")Output:
Flash (3 passes): $0.20Pro (3 passes): $2.46This showed me that Pro costs 12x more than Flash. For most documents, Flash provides good accuracy at much lower cost.
Solution #2: Use Batch API for 50% savings
The Batch API provides the same quality at 50% cost for large documents:
result = lx.extract( text="https://www.gutenberg.org/files/1513/1513-0.txt", model_id="gemini-2.5-flash", language_model_params={ "vertexai": True, "batch": {"enabled": True} # KEY: Enable batch mode }, extraction_passes=3, max_workers=20)With batch mode enabled, the cost for Romeo & Juliet dropped from $0.20 to $0.10.
Solution #3: Handle rate limits properly
You have two options for rate limits:
Option A: Apply for Tier 2 quota
Visit Vertex AI Tier 2 and request higher limits. This allows you to use max_workers=20 without hitting the 15 requests/minute cap.
Option B: Adjust workers and add delays
If you don’t have Tier 2 quota:
import os
use_tier2 = os.environ.get('VERTEX_AI_TIER2', 'false').lower() == 'true'
result = lx.extract( text="https://www.gutenberg.org/files/1513/1513-0.txt", model_id="gemini-2.5-flash", extraction_passes=3, max_workers=20 if use_tier2 else 5, # Respect rate limits language_model_params={ "vertexai": True, "batch": {"enabled": True} })With max_workers=5, processing takes longer but won’t hit rate limits.
Solution #4: Control chunking to avoid memory issues
For very large documents, control the chunking manually:
from langextract.chunking import ChunkingStrategy
# Create custom chunking strategystrategy = ChunkingStrategy( max_chunk_size=3000, # Smaller chunks (default is larger) max_char_buffer=500, # Less overlap between chunks delimiter="\n\n" # Break at paragraphs)
# Process in controlled chunkschunks = strategy.chunk(text)results = []
for i, chunk in enumerate(chunks): print(f"Processing chunk {i+1}/{len(chunks)}") result = lx.extract( text_or_documents=chunk, model_id="gemini-2.5-flash", extraction_passes=1 # Single pass for individual chunks ) results.append(result)
# Merge resultsfinal = lx.resolver.merge_annotated_documents(results)This gives you control over memory usage and processing speed.
Solution #5: Use parallel processing for speed
I benchmarked serial vs parallel processing on Romeo & Juliet:
import timeimport langextract as lx
# Serial processingstart = time.time()result_serial = lx.extract( text=large_text, max_workers=1, # Serial extraction_passes=1)serial_time = time.time() - start
# Parallel processingstart = time.time()result_parallel = lx.extract( text=large_text, max_workers=20, # Parallel chunks extraction_passes=1)parallel_time = time.time() - start
print(f"Serial: {serial_time:.1f}s")print(f"Parallel (20 workers): {parallel_time:.1f}s")print(f"Speedup: {serial_time/parallel_time:.1f}x")Results:
Serial: 340sParallel (20 workers): 45sSpeedup: 7.5x fasterParallel processing is 7.5x faster, but only works if you have Tier 2 quota.
Complete optimized function
Here’s a production-ready function that combines all optimizations:
import langextract as lximport os
def process_large_document( url: str, model_id: str = "gemini-2.5-flash", use_batch: bool = True, passes: int = 3): """Process large document with cost and speed optimization."""
# Check for Tier 2 quota use_tier2 = os.environ.get('VERTEX_AI_TIER2', 'false').lower() == 'true'
# Configure parameters params = { "text_or_documents": url, "model_id": model_id, "extraction_passes": passes, "max_workers": 20 if use_tier2 else 5, "max_char_buffer": 1000, # Smaller contexts "language_model_params": {} }
# Enable batch API for cost savings if use_batch: params["language_model_params"]["vertexai"] = True params["language_model_params"]["batch"] = {"enabled": True}
# Show configuration print(f"Processing with {model_id}...") print(f"Mode: {'Batch (50%% cheaper)' if use_batch else 'Standard'}") print(f"Workers: {params['max_workers']}") print(f"Passes: {passes}")
# Run extraction result = lx.extract(**params)
# Show results print(f"\nExtracted {len(result.extractions)} entities") print(f"Source text: {len(result.text):,} characters")
return result
# Usageresult = process_large_document( url="https://www.gutenberg.org/files/1513/1513-0.txt", model_id="gemini-2.5-flash", use_batch=True, passes=3)Decision guide
Here’s how to choose the right settings:
Large Document (>50K tokens)├── Need max accuracy?│ ├── Yes: gemini-2.5-pro + Batch API│ └── No: gemini-2.5-flash (default)├── Have Tier 2 quota?│ ├── Yes: Use max_workers=20│ └── No: Use max_workers=5└── Budget constraints? ├── Yes: gemini-2.0-flash (cheapest, good accuracy) └── No: Use hybrid (Flash first, Pro validate)The reason
The rate limit happens because LangExtract processes chunks in parallel. Each chunk sends an API request. With 20 workers and large documents, you quickly exceed the 15 requests/minute default quota.
The cost blowout occurs because:
- Large documents = many tokens
- Pro models cost 12x more than Flash
- Multiple extraction passes multiply the cost
The solutions work by:
- Batch API: Processes requests asynchronously at 50% discount
- Cost estimation: Lets you choose the right model before running
- Worker adjustment: Respects rate limits automatically
- Manual chunking: Controls memory and processing speed
Summary
In this post, I showed how to optimize LangExtract for large documents. The key points are:
- Calculate costs before running to choose the right model
- Use Batch API for 50% cost savings on large documents
- Apply for Tier 2 quota or reduce workers to avoid rate limits
- Use parallel processing (with Tier 2) for 7.5x speed improvement
- Control chunking manually for very large documents
For Romeo & Juliet, these optimizations reduced cost from $27 to $0.10 and processing time from 340s to 45s.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Google Vertex AI Tier 2 Quota
- 👨💻 Gemini Pricing (Jan 2025)
- 👨💻 LangExtract Documentation
- 👨💻 Project Gutenberg Romeo & Juliet
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments