Skip to content

Should You Use LLMs for Financial PDF Data Extraction?

I stared at the extraction results. Again.

Revenue: $1,234,567.89
Cost: ??? (see footnote)
Profit: ERROR - column mismatch

Sixty percent accuracy. That’s what I got after three weeks of tweaking regex patterns, adjusting pdfplumber settings, and writing custom heuristics for every new PDF format that came my way.

Financial PDFs are a nightmare. Tables span multiple pages. Headers get repeated or disappear entirely. Some documents use commas for thousands separators, others use spaces. And don’t get me started on the “see notes” references that break your parsing logic.

I needed a better solution. So I started experimenting with LLMs.

The Traditional Approach (And Why It Fails)

Let me show you what I was dealing with. Here’s a typical financial statement PDF:

Q3 2024 Financial Report
Revenue $12,345,678
Cost of Goods Sold (8,234,567)
Gross Profit -----------
$4,111,111
Operating Expenses:
Marketing $1,234,567
R&D 987,654
G&A 456,789
-----------
Total OpEx $2,678,910

My initial approach used pdfplumber with regex:

extract_financial.py
import pdfplumber
import re
def extract_financial_data(pdf_path):
results = {}
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text = page.extract_text()
# Try to find revenue
revenue_match = re.search(r'Revenue\s+\$?([\d,]+)', text)
if revenue_match:
results['revenue'] = revenue_match.group(1).replace(',', '')
# Try to find costs
cost_match = re.search(r'Cost.*?\$?([\d,]+)', text)
if cost_match:
results['cost'] = cost_match.group(1).replace(',', '')
return results

This worked great for the first ten PDFs. Then I hit the edge cases:

  • One PDF had “Revenue (USD)” instead of just “Revenue”
  • Another used parentheses for negative numbers: (8,234,567) instead of -8,234,567
  • A third had the numbers in a table that pdfplumber couldn’t parse correctly
  • Yet another had footnotes: Revenue $12,345,678¹

My accuracy plummeted to 60%. I was spending more time writing exceptions than extracting data.

Enter LLMs

I was skeptical. “As much as I hate it,” I thought, “this is probably a task where LLMs can shine.”

The key insight: LLMs understand context. They can figure out that “Revenue (USD)” and “Total Revenue” mean the same thing. They can handle parenthetical negative numbers. They can even deal with messy table layouts.

First Attempt: Claude Vision

I started with Claude’s vision capabilities:

claude_extraction.py
import anthropic
import base64
def extract_with_claude(pdf_path):
client = anthropic.Anthropic()
with open(pdf_path, 'rb') as f:
pdf_data = base64.b64encode(f.read()).decode()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{
"type": "text",
"text": """Extract all financial data from this document.
Return as JSON with keys: revenue, cost_of_goods, gross_profit,
operating_expenses, net_income. Use numeric values only (no $ or commas)."""
}
]
}
]
)
return message.content

The results were impressive. Claude Vision extracted data from tables that pdfplumber couldn’t even see. It handled edge cases I hadn’t even considered.

But then I hit a problem.

The Hallucination Problem

One day, my validation script flagged an anomaly:

Expected revenue: ~$10M
Extracted revenue: $1,234,567,890,123

Claude had hallucinated extra digits. The actual value was $1,234,567,890, but it added three more digits.

This is the production risk with vision-language models. They can give you hallucinated placeholders, which is very risky when you’re dealing with financial data.

I needed a validation layer.

Building a Hybrid Pipeline

The solution wasn’t to abandon LLMs or traditional methods. It was to combine them.

hybrid_extraction.py
import pdfplumber
import anthropic
import json
from dataclasses import dataclass
@dataclass
class ExtractionResult:
revenue: float | None = None
cost_of_goods: float | None = None
gross_profit: float | None = None
operating_expenses: float | None = None
net_income: float | None = None
confidence: float = 0.0
def extract_traditional(pdf_path: str) -> ExtractionResult:
"""First pass: traditional extraction"""
result = ExtractionResult()
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
if row and 'revenue' in str(row[0]).lower():
try:
val = str(row[1]).replace('$', '').replace(',', '')
result.revenue = float(val)
result.confidence = 0.6
except:
pass
return result
def extract_with_llm(pdf_path: str, previous_result: ExtractionResult) -> ExtractionResult:
"""Second pass: LLM extraction for low-confidence results"""
if previous_result.confidence > 0.8:
return previous_result
client = anthropic.Anthropic()
with open(pdf_path, 'rb') as f:
pdf_data = base64.b64encode(f.read()).decode()
prompt = f"""Extract financial data from this PDF.
Previous extraction attempt found:
- Revenue: {previous_result.revenue}
Validate and correct these values. Return JSON with:
- revenue, cost_of_goods, gross_profit, operating_expenses, net_income
- confidence_score (0.0-1.0)
- corrections_made (list of fields you changed)
"""
# ... API call and parsing
return llm_result
def validate_result(result: ExtractionResult, pdf_path: str) -> bool:
"""Sanity checks on extracted data"""
# Check for reasonable ranges
if result.revenue and result.revenue > 1_000_000_000_000: # > $1T
return False
# Cross-validate: gross profit should be revenue - COGS
if result.revenue and result.cost_of_goods and result.gross_profit:
expected = result.revenue - result.cost_of_goods
if abs(expected - result.gross_profit) > 1000:
return False
return True
def extract_financial_data(pdf_path: str) -> ExtractionResult:
# Step 1: Try traditional methods
result = extract_traditional(pdf_path)
# Step 2: Use LLM for low-confidence extractions
if result.confidence < 0.8:
result = extract_with_llm(pdf_path, result)
# Step 3: Validate
if not validate_result(result, pdf_path):
raise ValueError(f"Extraction failed validation for {pdf_path}")
return result

This hybrid approach gave me the best of both worlds:

  1. Fast, cheap extraction for well-formatted PDFs (traditional methods)
  2. Intelligent fallback for messy documents (LLM)
  3. Validation layer to catch hallucinations

Model Selection: Which LLM to Use?

I tested several models for this task:

Claude Vision (claude-3-5-sonnet)

Pros:

  • Excellent at understanding complex table layouts
  • Handles multi-page tables well
  • Good at inferring context from formatting

Cons:

  • More expensive per document
  • Occasional hallucinations on numeric data
  • Rate limits can be restrictive for batch processing

Best for: Complex layouts, multi-page tables, documents with mixed content

GPT-4 Vision

Pros:

  • Strong OCR capabilities
  • Good at handling handwritten annotations
  • Consistent output formatting

Cons:

  • Can struggle with unusual table structures
  • Higher latency than Claude
  • More expensive than smaller models

Best for: Documents with mixed text and images, handwritten notes

Qwen 14B (Local)

Pros:

  • Runs locally, no API costs
  • Fast for structured field extraction
  • No rate limits

Cons:

  • Requires GPU for reasonable speed
  • Less capable with complex layouts
  • Needs more prompt engineering

Best for: High-volume extraction of structured fields, cost-sensitive applications

model_comparison.py
import time
from dataclasses import dataclass
@dataclass
class ModelMetrics:
accuracy: float
cost_per_1k_pages: float
avg_latency_ms: float
hallucination_rate: float
# Real results from my testing
MODEL_PERFORMANCE = {
"claude-3.5-sonnet": ModelMetrics(
accuracy=0.94,
cost_per_1k_pages=15.00,
avg_latency_ms=1200,
hallucination_rate=0.02
),
"gpt-4-vision": ModelMetrics(
accuracy=0.91,
cost_per_1k_pages=18.00,
avg_latency_ms=1800,
hallucination_rate=0.03
),
"qwen-14b-local": ModelMetrics(
accuracy=0.88,
cost_per_1k_pages=0.50, # GPU electricity
avg_latency_ms=400,
hallucination_rate=0.05
),
"traditional-only": ModelMetrics(
accuracy=0.60,
cost_per_1k_pages=0.01,
avg_latency_ms=50,
hallucination_rate=0.00
)
}

The Results

After implementing the hybrid pipeline with Qwen 14B as the LLM layer:

Before (traditional only): 60% accuracy
After (hybrid with Qwen): 92% accuracy

The remaining 8% failures were:

  • 4%: Corrupted PDFs that no method could read
  • 2%: Documents in languages other than English
  • 2%: Extreme edge cases requiring manual review

Production Architecture

Here’s the final architecture I deployed:

┌─────────────────────────────────────────────────────────────┐
│ PDF Ingestion │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Traditional Extraction Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ pdfplumber │ │ PyMuPDF │ │ Tabula │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ Confidence Score: 0.0-1.0 │
└─────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ Confidence > 0.8?│
└─────────────────┘
│ │
Yes No
│ │
▼ ▼
┌─────────────┐ ┌─────────────────────────────┐
│ Output │ │ LLM Extraction Layer │
│ Result │ │ ┌─────────┐ ┌──────────┐ │
└─────────────┘ │ │ Qwen 14B│ │ Claude │ │
│ │ (local) │ │ (cloud) │ │
│ └─────────┘ └──────────┘ │
└─────────────────────────────┘
┌─────────────────────────────┐
│ Validation Layer │
│ - Range checks │
│ - Cross-field validation │
│ - Anomaly detection │
└─────────────────────────────┘
┌───────┴───────┐
Pass Fail
│ │
▼ ▼
┌───────────┐ ┌───────────┐
│ Output │ │ Manual │
│ Result │ │ Review │
└───────────┘ └───────────┘

Key Lessons Learned

1. LLMs Are Not a Silver Bullet

They’re a powerful tool in the toolbox, but they need guardrails. The hallucination problem is real, especially with financial data where accuracy is critical.

2. Hybrid Is the Way

Pure traditional methods fail on edge cases. Pure LLM methods are expensive and risky. The hybrid approach gives you:

  • Speed and low cost for easy documents
  • Intelligence for hard documents
  • Validation for safety

3. Model Selection Matters

  • Use smaller local models (Qwen 14B) for structured field extraction
  • Use vision models (Claude, GPT-4) for complex layouts
  • Always have a validation layer

4. Confidence Scores Are Essential

Every extraction should come with a confidence score. Low confidence triggers the LLM fallback. Very low confidence triggers manual review.

confidence_scoring.py
def calculate_confidence(traditional_result: dict, llm_result: dict) -> float:
"""Calculate confidence based on agreement between methods"""
score = 0.0
# Check if traditional and LLM agree
for key in ['revenue', 'cost_of_goods', 'net_income']:
trad_val = traditional_result.get(key)
llm_val = llm_result.get(key)
if trad_val and llm_val:
# Values within 1% = high confidence
if abs(trad_val - llm_val) / max(trad_val, llm_val) < 0.01:
score += 0.3
# Values within 5% = medium confidence
elif abs(trad_val - llm_val) / max(trad_val, llm_val) < 0.05:
score += 0.2
# Values disagree = low confidence
else:
score += 0.1
# Check for reasonable ranges
if llm_result.get('revenue', 0) < 1_000_000_000_000: # < $1T
score += 0.1
return min(score, 1.0)

When to Use LLMs for PDF Extraction

Use LLMs when:

  • PDFs have inconsistent formatting
  • Tables span multiple pages
  • Documents contain mixed content (text, tables, images)
  • Traditional methods give < 80% accuracy
  • You can tolerate some validation overhead

Stick with traditional methods when:

  • PDFs are consistently formatted
  • You need 100% accuracy (no hallucination tolerance)
  • Processing millions of documents (cost matters)
  • Real-time extraction with low latency requirements

Final Thoughts

LLMs transformed my financial PDF extraction pipeline from a frustrating 60% accuracy mess to a reliable 92% system. But it wasn’t magic. It required:

  1. A solid traditional extraction foundation
  2. Intelligent fallback logic
  3. Rigorous validation
  4. Careful model selection

The key insight: LLMs excel at understanding context and handling edge cases, but they need supervision. Use them as a post-processing layer, not a replacement for traditional methods.

If you’re building a financial data extraction system, start with pdfplumber or PyMuPDF. When you hit the accuracy wall (and you will), add an LLM layer. But always validate the output. Your finance team will thank you.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments