Skip to content

How to optimize LangExtract for large documents without hitting rate limits or blowing your budget

Problem

When I tried to extract entities from Romeo & Juliet using LangExtract, I hit multiple issues:

import langextract as lx
result = lx.extract(
text="https://www.gutenberg.org/files/1513/1513-0.txt",
model_id="gemini-2.5-flash",
max_workers=20,
extraction_passes=3
)

I got this error:

Resource exhausted: Rate limit exceeded. You can make 15 requests per minute.

Even when I reduced workers to 5, the processing was slow. When I tried with gemini-2.5-pro for better accuracy, the cost blew out to $27 for a single document.

The official docs show basic usage examples but don’t explain how to handle production-scale documents efficiently.

Environment

  • Python 3.11
  • langextract 0.6.2
  • Google Vertex AI (Gemini 2.5 Flash/Pro)
  • Document: Romeo & Juliet (147K characters, ~44K tokens)

What happened?

I wanted to extract character names and locations from a full novel. LangExtract worked great on short texts, but failed at scale.

Here’s what I tried first:

initial attempt
result = lx.extract(
text="https://www.gutenberg.org/files/1513/1513-0.txt",
prompt_description="Extract character names and locations",
examples=[...],
model_id="gemini-2.5-flash",
extraction_passes=3
)

This ran for a while then failed with the rate limit error. The problem: LangExtract chunks the text and processes chunks in parallel. With 20 workers and large documents, you hit the 15 requests/minute limit quickly.

Even when it worked, the costs were high. I calculated the actual expense and realized I needed a better approach.

Solution #1: Calculate costs before running

Before processing any large document, calculate the estimated cost:

cost_estimator.py
COSTS = {
"gemini-2.5-flash": 0.60, # $0.60 per 1M tokens
"gemini-2.5-pro": 7.50, # $7.50 per 1M tokens
"gemini-2.0-flash": 0.30, # $0.30 per 1M tokens (cheapest!)
}
def estimate_cost(text: str, model_id: str, passes: int = 1) -> float:
"""Estimate extraction cost in USD."""
# Rough tokenization: ~4 chars per token
estimated_tokens = len(text) / 4
cost_per_million = COSTS.get(model_id, 1.0)
return (estimated_tokens / 1_000_000) * cost_per_million * passes
# Usage
text = open("romeo_juliet.txt").read()
flash_cost = estimate_cost(text, "gemini-2.5-flash", passes=3)
pro_cost = estimate_cost(text, "gemini-2.5-pro", passes=3)
print(f"Flash (3 passes): ${flash_cost:.2f}")
print(f"Pro (3 passes): ${pro_cost:.2f}")

Output:

Flash (3 passes): $0.20
Pro (3 passes): $2.46

This showed me that Pro costs 12x more than Flash. For most documents, Flash provides good accuracy at much lower cost.

Solution #2: Use Batch API for 50% savings

The Batch API provides the same quality at 50% cost for large documents:

result = lx.extract(
text="https://www.gutenberg.org/files/1513/1513-0.txt",
model_id="gemini-2.5-flash",
language_model_params={
"vertexai": True,
"batch": {"enabled": True} # KEY: Enable batch mode
},
extraction_passes=3,
max_workers=20
)

With batch mode enabled, the cost for Romeo & Juliet dropped from $0.20 to $0.10.

Solution #3: Handle rate limits properly

You have two options for rate limits:

Option A: Apply for Tier 2 quota

Visit Vertex AI Tier 2 and request higher limits. This allows you to use max_workers=20 without hitting the 15 requests/minute cap.

Option B: Adjust workers and add delays

If you don’t have Tier 2 quota:

import os
use_tier2 = os.environ.get('VERTEX_AI_TIER2', 'false').lower() == 'true'
result = lx.extract(
text="https://www.gutenberg.org/files/1513/1513-0.txt",
model_id="gemini-2.5-flash",
extraction_passes=3,
max_workers=20 if use_tier2 else 5, # Respect rate limits
language_model_params={
"vertexai": True,
"batch": {"enabled": True}
}
)

With max_workers=5, processing takes longer but won’t hit rate limits.

Solution #4: Control chunking to avoid memory issues

For very large documents, control the chunking manually:

from langextract.chunking import ChunkingStrategy
# Create custom chunking strategy
strategy = ChunkingStrategy(
max_chunk_size=3000, # Smaller chunks (default is larger)
max_char_buffer=500, # Less overlap between chunks
delimiter="\n\n" # Break at paragraphs
)
# Process in controlled chunks
chunks = strategy.chunk(text)
results = []
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1}/{len(chunks)}")
result = lx.extract(
text_or_documents=chunk,
model_id="gemini-2.5-flash",
extraction_passes=1 # Single pass for individual chunks
)
results.append(result)
# Merge results
final = lx.resolver.merge_annotated_documents(results)

This gives you control over memory usage and processing speed.

Solution #5: Use parallel processing for speed

I benchmarked serial vs parallel processing on Romeo & Juliet:

import time
import langextract as lx
# Serial processing
start = time.time()
result_serial = lx.extract(
text=large_text,
max_workers=1, # Serial
extraction_passes=1
)
serial_time = time.time() - start
# Parallel processing
start = time.time()
result_parallel = lx.extract(
text=large_text,
max_workers=20, # Parallel chunks
extraction_passes=1
)
parallel_time = time.time() - start
print(f"Serial: {serial_time:.1f}s")
print(f"Parallel (20 workers): {parallel_time:.1f}s")
print(f"Speedup: {serial_time/parallel_time:.1f}x")

Results:

Serial: 340s
Parallel (20 workers): 45s
Speedup: 7.5x faster

Parallel processing is 7.5x faster, but only works if you have Tier 2 quota.

Complete optimized function

Here’s a production-ready function that combines all optimizations:

optimized_extractor.py
import langextract as lx
import os
def process_large_document(
url: str,
model_id: str = "gemini-2.5-flash",
use_batch: bool = True,
passes: int = 3
):
"""Process large document with cost and speed optimization."""
# Check for Tier 2 quota
use_tier2 = os.environ.get('VERTEX_AI_TIER2', 'false').lower() == 'true'
# Configure parameters
params = {
"text_or_documents": url,
"model_id": model_id,
"extraction_passes": passes,
"max_workers": 20 if use_tier2 else 5,
"max_char_buffer": 1000, # Smaller contexts
"language_model_params": {}
}
# Enable batch API for cost savings
if use_batch:
params["language_model_params"]["vertexai"] = True
params["language_model_params"]["batch"] = {"enabled": True}
# Show configuration
print(f"Processing with {model_id}...")
print(f"Mode: {'Batch (50%% cheaper)' if use_batch else 'Standard'}")
print(f"Workers: {params['max_workers']}")
print(f"Passes: {passes}")
# Run extraction
result = lx.extract(**params)
# Show results
print(f"\nExtracted {len(result.extractions)} entities")
print(f"Source text: {len(result.text):,} characters")
return result
# Usage
result = process_large_document(
url="https://www.gutenberg.org/files/1513/1513-0.txt",
model_id="gemini-2.5-flash",
use_batch=True,
passes=3
)

Decision guide

Here’s how to choose the right settings:

Large Document (>50K tokens)
├── Need max accuracy?
│ ├── Yes: gemini-2.5-pro + Batch API
│ └── No: gemini-2.5-flash (default)
├── Have Tier 2 quota?
│ ├── Yes: Use max_workers=20
│ └── No: Use max_workers=5
└── Budget constraints?
├── Yes: gemini-2.0-flash (cheapest, good accuracy)
└── No: Use hybrid (Flash first, Pro validate)

The reason

The rate limit happens because LangExtract processes chunks in parallel. Each chunk sends an API request. With 20 workers and large documents, you quickly exceed the 15 requests/minute default quota.

The cost blowout occurs because:

  1. Large documents = many tokens
  2. Pro models cost 12x more than Flash
  3. Multiple extraction passes multiply the cost

The solutions work by:

  • Batch API: Processes requests asynchronously at 50% discount
  • Cost estimation: Lets you choose the right model before running
  • Worker adjustment: Respects rate limits automatically
  • Manual chunking: Controls memory and processing speed

Summary

In this post, I showed how to optimize LangExtract for large documents. The key points are:

  1. Calculate costs before running to choose the right model
  2. Use Batch API for 50% cost savings on large documents
  3. Apply for Tier 2 quota or reduce workers to avoid rate limits
  4. Use parallel processing (with Tier 2) for 7.5x speed improvement
  5. Control chunking manually for very large documents

For Romeo & Juliet, these optimizations reduced cost from $27 to $0.10 and processing time from 340s to 45s.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments