Should You Use LLMs for Financial PDF Data Extraction?
I stared at the extraction results. Again.
Revenue: $1,234,567.89Cost: ??? (see footnote)Profit: ERROR - column mismatchSixty percent accuracy. That’s what I got after three weeks of tweaking regex patterns, adjusting pdfplumber settings, and writing custom heuristics for every new PDF format that came my way.
Financial PDFs are a nightmare. Tables span multiple pages. Headers get repeated or disappear entirely. Some documents use commas for thousands separators, others use spaces. And don’t get me started on the “see notes” references that break your parsing logic.
I needed a better solution. So I started experimenting with LLMs.
The Traditional Approach (And Why It Fails)
Let me show you what I was dealing with. Here’s a typical financial statement PDF:
Q3 2024 Financial ReportRevenue $12,345,678Cost of Goods Sold (8,234,567)Gross Profit ----------- $4,111,111Operating Expenses: Marketing $1,234,567 R&D 987,654 G&A 456,789 -----------Total OpEx $2,678,910My initial approach used pdfplumber with regex:
import pdfplumberimport re
def extract_financial_data(pdf_path): results = {} with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: text = page.extract_text() # Try to find revenue revenue_match = re.search(r'Revenue\s+\$?([\d,]+)', text) if revenue_match: results['revenue'] = revenue_match.group(1).replace(',', '')
# Try to find costs cost_match = re.search(r'Cost.*?\$?([\d,]+)', text) if cost_match: results['cost'] = cost_match.group(1).replace(',', '')
return resultsThis worked great for the first ten PDFs. Then I hit the edge cases:
- One PDF had “Revenue (USD)” instead of just “Revenue”
- Another used parentheses for negative numbers:
(8,234,567)instead of-8,234,567 - A third had the numbers in a table that pdfplumber couldn’t parse correctly
- Yet another had footnotes:
Revenue $12,345,678¹
My accuracy plummeted to 60%. I was spending more time writing exceptions than extracting data.
Enter LLMs
I was skeptical. “As much as I hate it,” I thought, “this is probably a task where LLMs can shine.”
The key insight: LLMs understand context. They can figure out that “Revenue (USD)” and “Total Revenue” mean the same thing. They can handle parenthetical negative numbers. They can even deal with messy table layouts.
First Attempt: Claude Vision
I started with Claude’s vision capabilities:
import anthropicimport base64
def extract_with_claude(pdf_path): client = anthropic.Anthropic()
with open(pdf_path, 'rb') as f: pdf_data = base64.b64encode(f.read()).decode()
message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ { "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data } }, { "type": "text", "text": """Extract all financial data from this document. Return as JSON with keys: revenue, cost_of_goods, gross_profit, operating_expenses, net_income. Use numeric values only (no $ or commas).""" } ] } ] )
return message.contentThe results were impressive. Claude Vision extracted data from tables that pdfplumber couldn’t even see. It handled edge cases I hadn’t even considered.
But then I hit a problem.
The Hallucination Problem
One day, my validation script flagged an anomaly:
Expected revenue: ~$10MExtracted revenue: $1,234,567,890,123Claude had hallucinated extra digits. The actual value was $1,234,567,890, but it added three more digits.
This is the production risk with vision-language models. They can give you hallucinated placeholders, which is very risky when you’re dealing with financial data.
I needed a validation layer.
Building a Hybrid Pipeline
The solution wasn’t to abandon LLMs or traditional methods. It was to combine them.
import pdfplumberimport anthropicimport jsonfrom dataclasses import dataclass
@dataclassclass ExtractionResult: revenue: float | None = None cost_of_goods: float | None = None gross_profit: float | None = None operating_expenses: float | None = None net_income: float | None = None confidence: float = 0.0
def extract_traditional(pdf_path: str) -> ExtractionResult: """First pass: traditional extraction""" result = ExtractionResult()
with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: tables = page.extract_tables() for table in tables: for row in table: if row and 'revenue' in str(row[0]).lower(): try: val = str(row[1]).replace('$', '').replace(',', '') result.revenue = float(val) result.confidence = 0.6 except: pass
return result
def extract_with_llm(pdf_path: str, previous_result: ExtractionResult) -> ExtractionResult: """Second pass: LLM extraction for low-confidence results""" if previous_result.confidence > 0.8: return previous_result
client = anthropic.Anthropic()
with open(pdf_path, 'rb') as f: pdf_data = base64.b64encode(f.read()).decode()
prompt = f"""Extract financial data from this PDF.
Previous extraction attempt found: - Revenue: {previous_result.revenue}
Validate and correct these values. Return JSON with: - revenue, cost_of_goods, gross_profit, operating_expenses, net_income - confidence_score (0.0-1.0) - corrections_made (list of fields you changed) """
# ... API call and parsing
return llm_result
def validate_result(result: ExtractionResult, pdf_path: str) -> bool: """Sanity checks on extracted data""" # Check for reasonable ranges if result.revenue and result.revenue > 1_000_000_000_000: # > $1T return False
# Cross-validate: gross profit should be revenue - COGS if result.revenue and result.cost_of_goods and result.gross_profit: expected = result.revenue - result.cost_of_goods if abs(expected - result.gross_profit) > 1000: return False
return True
def extract_financial_data(pdf_path: str) -> ExtractionResult: # Step 1: Try traditional methods result = extract_traditional(pdf_path)
# Step 2: Use LLM for low-confidence extractions if result.confidence < 0.8: result = extract_with_llm(pdf_path, result)
# Step 3: Validate if not validate_result(result, pdf_path): raise ValueError(f"Extraction failed validation for {pdf_path}")
return resultThis hybrid approach gave me the best of both worlds:
- Fast, cheap extraction for well-formatted PDFs (traditional methods)
- Intelligent fallback for messy documents (LLM)
- Validation layer to catch hallucinations
Model Selection: Which LLM to Use?
I tested several models for this task:
Claude Vision (claude-3-5-sonnet)
Pros:
- Excellent at understanding complex table layouts
- Handles multi-page tables well
- Good at inferring context from formatting
Cons:
- More expensive per document
- Occasional hallucinations on numeric data
- Rate limits can be restrictive for batch processing
Best for: Complex layouts, multi-page tables, documents with mixed content
GPT-4 Vision
Pros:
- Strong OCR capabilities
- Good at handling handwritten annotations
- Consistent output formatting
Cons:
- Can struggle with unusual table structures
- Higher latency than Claude
- More expensive than smaller models
Best for: Documents with mixed text and images, handwritten notes
Qwen 14B (Local)
Pros:
- Runs locally, no API costs
- Fast for structured field extraction
- No rate limits
Cons:
- Requires GPU for reasonable speed
- Less capable with complex layouts
- Needs more prompt engineering
Best for: High-volume extraction of structured fields, cost-sensitive applications
import timefrom dataclasses import dataclass
@dataclassclass ModelMetrics: accuracy: float cost_per_1k_pages: float avg_latency_ms: float hallucination_rate: float
# Real results from my testingMODEL_PERFORMANCE = { "claude-3.5-sonnet": ModelMetrics( accuracy=0.94, cost_per_1k_pages=15.00, avg_latency_ms=1200, hallucination_rate=0.02 ), "gpt-4-vision": ModelMetrics( accuracy=0.91, cost_per_1k_pages=18.00, avg_latency_ms=1800, hallucination_rate=0.03 ), "qwen-14b-local": ModelMetrics( accuracy=0.88, cost_per_1k_pages=0.50, # GPU electricity avg_latency_ms=400, hallucination_rate=0.05 ), "traditional-only": ModelMetrics( accuracy=0.60, cost_per_1k_pages=0.01, avg_latency_ms=50, hallucination_rate=0.00 )}The Results
After implementing the hybrid pipeline with Qwen 14B as the LLM layer:
Before (traditional only): 60% accuracyAfter (hybrid with Qwen): 92% accuracyThe remaining 8% failures were:
- 4%: Corrupted PDFs that no method could read
- 2%: Documents in languages other than English
- 2%: Extreme edge cases requiring manual review
Production Architecture
Here’s the final architecture I deployed:
┌─────────────────────────────────────────────────────────────┐│ PDF Ingestion │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Traditional Extraction Layer ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ pdfplumber │ │ PyMuPDF │ │ Tabula │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ Confidence Score: 0.0-1.0 │└─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────┐ │ Confidence > 0.8?│ └─────────────────┘ │ │ Yes No │ │ ▼ ▼ ┌─────────────┐ ┌─────────────────────────────┐ │ Output │ │ LLM Extraction Layer │ │ Result │ │ ┌─────────┐ ┌──────────┐ │ └─────────────┘ │ │ Qwen 14B│ │ Claude │ │ │ │ (local) │ │ (cloud) │ │ │ └─────────┘ └──────────┘ │ └─────────────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ Validation Layer │ │ - Range checks │ │ - Cross-field validation │ │ - Anomaly detection │ └─────────────────────────────┘ │ ┌───────┴───────┐ Pass Fail │ │ ▼ ▼ ┌───────────┐ ┌───────────┐ │ Output │ │ Manual │ │ Result │ │ Review │ └───────────┘ └───────────┘Key Lessons Learned
1. LLMs Are Not a Silver Bullet
They’re a powerful tool in the toolbox, but they need guardrails. The hallucination problem is real, especially with financial data where accuracy is critical.
2. Hybrid Is the Way
Pure traditional methods fail on edge cases. Pure LLM methods are expensive and risky. The hybrid approach gives you:
- Speed and low cost for easy documents
- Intelligence for hard documents
- Validation for safety
3. Model Selection Matters
- Use smaller local models (Qwen 14B) for structured field extraction
- Use vision models (Claude, GPT-4) for complex layouts
- Always have a validation layer
4. Confidence Scores Are Essential
Every extraction should come with a confidence score. Low confidence triggers the LLM fallback. Very low confidence triggers manual review.
def calculate_confidence(traditional_result: dict, llm_result: dict) -> float: """Calculate confidence based on agreement between methods""" score = 0.0
# Check if traditional and LLM agree for key in ['revenue', 'cost_of_goods', 'net_income']: trad_val = traditional_result.get(key) llm_val = llm_result.get(key)
if trad_val and llm_val: # Values within 1% = high confidence if abs(trad_val - llm_val) / max(trad_val, llm_val) < 0.01: score += 0.3 # Values within 5% = medium confidence elif abs(trad_val - llm_val) / max(trad_val, llm_val) < 0.05: score += 0.2 # Values disagree = low confidence else: score += 0.1
# Check for reasonable ranges if llm_result.get('revenue', 0) < 1_000_000_000_000: # < $1T score += 0.1
return min(score, 1.0)When to Use LLMs for PDF Extraction
Use LLMs when:
- PDFs have inconsistent formatting
- Tables span multiple pages
- Documents contain mixed content (text, tables, images)
- Traditional methods give < 80% accuracy
- You can tolerate some validation overhead
Stick with traditional methods when:
- PDFs are consistently formatted
- You need 100% accuracy (no hallucination tolerance)
- Processing millions of documents (cost matters)
- Real-time extraction with low latency requirements
Final Thoughts
LLMs transformed my financial PDF extraction pipeline from a frustrating 60% accuracy mess to a reliable 92% system. But it wasn’t magic. It required:
- A solid traditional extraction foundation
- Intelligent fallback logic
- Rigorous validation
- Careful model selection
The key insight: LLMs excel at understanding context and handling edge cases, but they need supervision. Use them as a post-processing layer, not a replacement for traditional methods.
If you’re building a financial data extraction system, start with pdfplumber or PyMuPDF. When you hit the accuracy wall (and you will), add an LLM layer. But always validate the output. Your finance team will thank you.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments