How to optimize LangExtract for large documents without hitting rate limits or blowing your budget

Feb 12, 2026

Problem

When I tried to extract entities from Romeo & Juliet using LangExtract, I hit multiple issues:

import langextract as lx

result = lx.extract(
    text="https://www.gutenberg.org/files/1513/1513-0.txt",
    model_id="gemini-2.5-flash",
    max_workers=20,
    extraction_passes=3
)

I got this error:

Resource exhausted: Rate limit exceeded. You can make 15 requests per minute.

Even when I reduced workers to 5, the processing was slow. When I tried with gemini-2.5-pro for better accuracy, the cost blew out to $27 for a single document.

The official docs show basic usage examples but don’t explain how to handle production-scale documents efficiently.

Environment

Python 3.11
langextract 0.6.2
Google Vertex AI (Gemini 2.5 Flash/Pro)
Document: Romeo & Juliet (147K characters, ~44K tokens)

What happened?

I wanted to extract character names and locations from a full novel. LangExtract worked great on short texts, but failed at scale.

Here’s what I tried first:

result = lx.extract(
    text="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description="Extract character names and locations",
    examples=[...],
    model_id="gemini-2.5-flash",
    extraction_passes=3
)

This ran for a while then failed with the rate limit error. The problem: LangExtract chunks the text and processes chunks in parallel. With 20 workers and large documents, you hit the 15 requests/minute limit quickly.

Even when it worked, the costs were high. I calculated the actual expense and realized I needed a better approach.

Solution #1: Calculate costs before running

Before processing any large document, calculate the estimated cost:

COSTS = {
    "gemini-2.5-flash": 0.60,     # $0.60 per 1M tokens
    "gemini-2.5-pro": 7.50,        # $7.50 per 1M tokens
    "gemini-2.0-flash": 0.30,      # $0.30 per 1M tokens (cheapest!)
}

def estimate_cost(text: str, model_id: str, passes: int = 1) -> float:
    """Estimate extraction cost in USD."""
    # Rough tokenization: ~4 chars per token
    estimated_tokens = len(text) / 4
    cost_per_million = COSTS.get(model_id, 1.0)
    return (estimated_tokens / 1_000_000) * cost_per_million * passes

# Usage
text = open("romeo_juliet.txt").read()
flash_cost = estimate_cost(text, "gemini-2.5-flash", passes=3)
pro_cost = estimate_cost(text, "gemini-2.5-pro", passes=3)

print(f"Flash (3 passes): ${flash_cost:.2f}")
print(f"Pro (3 passes): ${pro_cost:.2f}")

Output:

Flash (3 passes): $0.20
Pro (3 passes): $2.46

This showed me that Pro costs 12x more than Flash. For most documents, Flash provides good accuracy at much lower cost.

Solution #2: Use Batch API for 50% savings

The Batch API provides the same quality at 50% cost for large documents:

result = lx.extract(
    text="https://www.gutenberg.org/files/1513/1513-0.txt",
    model_id="gemini-2.5-flash",
    language_model_params={
        "vertexai": True,
        "batch": {"enabled": True}  # KEY: Enable batch mode
    },
    extraction_passes=3,
    max_workers=20
)

With batch mode enabled, the cost for Romeo & Juliet dropped from $0.20 to $0.10.

Solution #3: Handle rate limits properly

You have two options for rate limits:

Option A: Apply for Tier 2 quota

Visit Vertex AI Tier 2 and request higher limits. This allows you to use max_workers=20 without hitting the 15 requests/minute cap.

Option B: Adjust workers and add delays

If you don’t have Tier 2 quota:

import os

use_tier2 = os.environ.get('VERTEX_AI_TIER2', 'false').lower() == 'true'

result = lx.extract(
    text="https://www.gutenberg.org/files/1513/1513-0.txt",
    model_id="gemini-2.5-flash",
    extraction_passes=3,
    max_workers=20 if use_tier2 else 5,  # Respect rate limits
    language_model_params={
        "vertexai": True,
        "batch": {"enabled": True}
    }
)

With max_workers=5, processing takes longer but won’t hit rate limits.

Solution #4: Control chunking to avoid memory issues

For very large documents, control the chunking manually:

from langextract.chunking import ChunkingStrategy

# Create custom chunking strategy
strategy = ChunkingStrategy(
    max_chunk_size=3000,     # Smaller chunks (default is larger)
    max_char_buffer=500,       # Less overlap between chunks
    delimiter="\n\n"           # Break at paragraphs
)

# Process in controlled chunks
chunks = strategy.chunk(text)
results = []

for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i+1}/{len(chunks)}")
    result = lx.extract(
        text_or_documents=chunk,
        model_id="gemini-2.5-flash",
        extraction_passes=1  # Single pass for individual chunks
    )
    results.append(result)

# Merge results
final = lx.resolver.merge_annotated_documents(results)

This gives you control over memory usage and processing speed.

Solution #5: Use parallel processing for speed

I benchmarked serial vs parallel processing on Romeo & Juliet:

import time
import langextract as lx

# Serial processing
start = time.time()
result_serial = lx.extract(
    text=large_text,
    max_workers=1,  # Serial
    extraction_passes=1
)
serial_time = time.time() - start

# Parallel processing
start = time.time()
result_parallel = lx.extract(
    text=large_text,
    max_workers=20,  # Parallel chunks
    extraction_passes=1
)
parallel_time = time.time() - start

print(f"Serial: {serial_time:.1f}s")
print(f"Parallel (20 workers): {parallel_time:.1f}s")
print(f"Speedup: {serial_time/parallel_time:.1f}x")

Results:

Serial: 340s
Parallel (20 workers): 45s
Speedup: 7.5x faster

Parallel processing is 7.5x faster, but only works if you have Tier 2 quota.

Complete optimized function

Here’s a production-ready function that combines all optimizations:

import langextract as lx
import os

def process_large_document(
    url: str,
    model_id: str = "gemini-2.5-flash",
    use_batch: bool = True,
    passes: int = 3
):
    """Process large document with cost and speed optimization."""

    # Check for Tier 2 quota
    use_tier2 = os.environ.get('VERTEX_AI_TIER2', 'false').lower() == 'true'

    # Configure parameters
    params = {
        "text_or_documents": url,
        "model_id": model_id,
        "extraction_passes": passes,
        "max_workers": 20 if use_tier2 else 5,
        "max_char_buffer": 1000,  # Smaller contexts
        "language_model_params": {}
    }

    # Enable batch API for cost savings
    if use_batch:
        params["language_model_params"]["vertexai"] = True
        params["language_model_params"]["batch"] = {"enabled": True}

    # Show configuration
    print(f"Processing with {model_id}...")
    print(f"Mode: {'Batch (50%% cheaper)' if use_batch else 'Standard'}")
    print(f"Workers: {params['max_workers']}")
    print(f"Passes: {passes}")

    # Run extraction
    result = lx.extract(**params)

    # Show results
    print(f"\nExtracted {len(result.extractions)} entities")
    print(f"Source text: {len(result.text):,} characters")

    return result

# Usage
result = process_large_document(
    url="https://www.gutenberg.org/files/1513/1513-0.txt",
    model_id="gemini-2.5-flash",
    use_batch=True,
    passes=3
)

Decision guide

Here’s how to choose the right settings:

Large Document (>50K tokens)
├── Need max accuracy?
│   ├── Yes: gemini-2.5-pro + Batch API
│   └── No: gemini-2.5-flash (default)
├── Have Tier 2 quota?
│   ├── Yes: Use max_workers=20
│   └── No: Use max_workers=5
└── Budget constraints?
    ├── Yes: gemini-2.0-flash (cheapest, good accuracy)
    └── No: Use hybrid (Flash first, Pro validate)

The reason

The rate limit happens because LangExtract processes chunks in parallel. Each chunk sends an API request. With 20 workers and large documents, you quickly exceed the 15 requests/minute default quota.

The cost blowout occurs because:

Large documents = many tokens
Pro models cost 12x more than Flash
Multiple extraction passes multiply the cost

The solutions work by:

Batch API: Processes requests asynchronously at 50% discount
Cost estimation: Lets you choose the right model before running
Worker adjustment: Respects rate limits automatically
Manual chunking: Controls memory and processing speed

Summary

In this post, I showed how to optimize LangExtract for large documents. The key points are:

Calculate costs before running to choose the right model
Use Batch API for 50% cost savings on large documents
Apply for Tier 2 quota or reduce workers to avoid rate limits
Use parallel processing (with Tier 2) for 7.5x speed improvement
Control chunking manually for very large documents

For Romeo & Juliet, these optimizations reduced cost from $27 to $0.10 and processing time from 340s to 45s.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!