Should You Use LLMs for Financial PDF Data Extraction?

Mar 16, 2026

I stared at the extraction results. Again.

Revenue: $1,234,567.89
Cost: ??? (see footnote)
Profit: ERROR - column mismatch

Sixty percent accuracy. That’s what I got after three weeks of tweaking regex patterns, adjusting pdfplumber settings, and writing custom heuristics for every new PDF format that came my way.

Financial PDFs are a nightmare. Tables span multiple pages. Headers get repeated or disappear entirely. Some documents use commas for thousands separators, others use spaces. And don’t get me started on the “see notes” references that break your parsing logic.

I needed a better solution. So I started experimenting with LLMs.

The Traditional Approach (And Why It Fails)

Let me show you what I was dealing with. Here’s a typical financial statement PDF:

Q3 2024 Financial Report
Revenue                    $12,345,678
Cost of Goods Sold         (8,234,567)
Gross Profit              -----------
                           $4,111,111
Operating Expenses:
  Marketing                $1,234,567
  R&D                        987,654
  G&A                        456,789
                           -----------
Total OpEx                $2,678,910

My initial approach used pdfplumber with regex:

import pdfplumber
import re

def extract_financial_data(pdf_path):
    results = {}
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            # Try to find revenue
            revenue_match = re.search(r'Revenue\s+\$?([\d,]+)', text)
            if revenue_match:
                results['revenue'] = revenue_match.group(1).replace(',', '')

            # Try to find costs
            cost_match = re.search(r'Cost.*?\$?([\d,]+)', text)
            if cost_match:
                results['cost'] = cost_match.group(1).replace(',', '')

    return results

This worked great for the first ten PDFs. Then I hit the edge cases:

One PDF had “Revenue (USD)” instead of just “Revenue”
Another used parentheses for negative numbers: (8,234,567) instead of -8,234,567
A third had the numbers in a table that pdfplumber couldn’t parse correctly
Yet another had footnotes: Revenue $12,345,678¹

My accuracy plummeted to 60%. I was spending more time writing exceptions than extracting data.

Enter LLMs

I was skeptical. “As much as I hate it,” I thought, “this is probably a task where LLMs can shine.”

The key insight: LLMs understand context. They can figure out that “Revenue (USD)” and “Total Revenue” mean the same thing. They can handle parenthetical negative numbers. They can even deal with messy table layouts.

First Attempt: Claude Vision

I started with Claude’s vision capabilities:

import anthropic
import base64

def extract_with_claude(pdf_path):
    client = anthropic.Anthropic()

    with open(pdf_path, 'rb') as f:
        pdf_data = base64.b64encode(f.read()).decode()

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_data
                        }
                    },
                    {
                        "type": "text",
                        "text": """Extract all financial data from this document.
                        Return as JSON with keys: revenue, cost_of_goods, gross_profit,
                        operating_expenses, net_income. Use numeric values only (no $ or commas)."""
                    }
                ]
            }
        ]
    )

    return message.content

The results were impressive. Claude Vision extracted data from tables that pdfplumber couldn’t even see. It handled edge cases I hadn’t even considered.

But then I hit a problem.

The Hallucination Problem

One day, my validation script flagged an anomaly:

Expected revenue: ~$10M
Extracted revenue: $1,234,567,890,123

Claude had hallucinated extra digits. The actual value was $1,234,567,890, but it added three more digits.

This is the production risk with vision-language models. They can give you hallucinated placeholders, which is very risky when you’re dealing with financial data.

I needed a validation layer.

Building a Hybrid Pipeline

The solution wasn’t to abandon LLMs or traditional methods. It was to combine them.

import pdfplumber
import anthropic
import json
from dataclasses import dataclass

@dataclass
class ExtractionResult:
    revenue: float | None = None
    cost_of_goods: float | None = None
    gross_profit: float | None = None
    operating_expenses: float | None = None
    net_income: float | None = None
    confidence: float = 0.0

def extract_traditional(pdf_path: str) -> ExtractionResult:
    """First pass: traditional extraction"""
    result = ExtractionResult()

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            tables = page.extract_tables()
            for table in tables:
                for row in table:
                    if row and 'revenue' in str(row[0]).lower():
                        try:
                            val = str(row[1]).replace('$', '').replace(',', '')
                            result.revenue = float(val)
                            result.confidence = 0.6
                        except:
                            pass

    return result

def extract_with_llm(pdf_path: str, previous_result: ExtractionResult) -> ExtractionResult:
    """Second pass: LLM extraction for low-confidence results"""
    if previous_result.confidence > 0.8:
        return previous_result

    client = anthropic.Anthropic()

    with open(pdf_path, 'rb') as f:
        pdf_data = base64.b64encode(f.read()).decode()

    prompt = f"""Extract financial data from this PDF.

    Previous extraction attempt found:
    - Revenue: {previous_result.revenue}

    Validate and correct these values. Return JSON with:
    - revenue, cost_of_goods, gross_profit, operating_expenses, net_income
    - confidence_score (0.0-1.0)
    - corrections_made (list of fields you changed)
    """

    # ... API call and parsing

    return llm_result

def validate_result(result: ExtractionResult, pdf_path: str) -> bool:
    """Sanity checks on extracted data"""
    # Check for reasonable ranges
    if result.revenue and result.revenue > 1_000_000_000_000:  # > $1T
        return False

    # Cross-validate: gross profit should be revenue - COGS
    if result.revenue and result.cost_of_goods and result.gross_profit:
        expected = result.revenue - result.cost_of_goods
        if abs(expected - result.gross_profit) > 1000:
            return False

    return True

def extract_financial_data(pdf_path: str) -> ExtractionResult:
    # Step 1: Try traditional methods
    result = extract_traditional(pdf_path)

    # Step 2: Use LLM for low-confidence extractions
    if result.confidence < 0.8:
        result = extract_with_llm(pdf_path, result)

    # Step 3: Validate
    if not validate_result(result, pdf_path):
        raise ValueError(f"Extraction failed validation for {pdf_path}")

    return result

This hybrid approach gave me the best of both worlds:

Fast, cheap extraction for well-formatted PDFs (traditional methods)
Intelligent fallback for messy documents (LLM)
Validation layer to catch hallucinations

Model Selection: Which LLM to Use?

I tested several models for this task:

Claude Vision (claude-3-5-sonnet)

Pros:

Excellent at understanding complex table layouts
Handles multi-page tables well
Good at inferring context from formatting

Cons:

More expensive per document
Occasional hallucinations on numeric data
Rate limits can be restrictive for batch processing

Best for: Complex layouts, multi-page tables, documents with mixed content

GPT-4 Vision

Pros:

Strong OCR capabilities
Good at handling handwritten annotations
Consistent output formatting

Cons:

Can struggle with unusual table structures
Higher latency than Claude
More expensive than smaller models

Best for: Documents with mixed text and images, handwritten notes

Qwen 14B (Local)

Pros:

Runs locally, no API costs
Fast for structured field extraction
No rate limits

Cons:

Requires GPU for reasonable speed
Less capable with complex layouts
Needs more prompt engineering

Best for: High-volume extraction of structured fields, cost-sensitive applications

import time
from dataclasses import dataclass

@dataclass
class ModelMetrics:
    accuracy: float
    cost_per_1k_pages: float
    avg_latency_ms: float
    hallucination_rate: float

# Real results from my testing
MODEL_PERFORMANCE = {
    "claude-3.5-sonnet": ModelMetrics(
        accuracy=0.94,
        cost_per_1k_pages=15.00,
        avg_latency_ms=1200,
        hallucination_rate=0.02
    ),
    "gpt-4-vision": ModelMetrics(
        accuracy=0.91,
        cost_per_1k_pages=18.00,
        avg_latency_ms=1800,
        hallucination_rate=0.03
    ),
    "qwen-14b-local": ModelMetrics(
        accuracy=0.88,
        cost_per_1k_pages=0.50,  # GPU electricity
        avg_latency_ms=400,
        hallucination_rate=0.05
    ),
    "traditional-only": ModelMetrics(
        accuracy=0.60,
        cost_per_1k_pages=0.01,
        avg_latency_ms=50,
        hallucination_rate=0.00
    )
}

The Results

After implementing the hybrid pipeline with Qwen 14B as the LLM layer:

Before (traditional only): 60% accuracy
After (hybrid with Qwen):   92% accuracy

The remaining 8% failures were:

4%: Corrupted PDFs that no method could read
2%: Documents in languages other than English
2%: Extreme edge cases requiring manual review

Production Architecture

Here’s the final architecture I deployed:

┌─────────────────────────────────────────────────────────────┐
│                     PDF Ingestion                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Traditional Extraction Layer                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ pdfplumber  │  │   PyMuPDF   │  │   Tabula    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                     Confidence Score: 0.0-1.0               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │ Confidence > 0.8?│
                    └─────────────────┘
                     │              │
                    Yes            No
                     │              │
                     ▼              ▼
            ┌─────────────┐  ┌─────────────────────────────┐
            │   Output    │  │      LLM Extraction Layer    │
            │   Result    │  │  ┌─────────┐  ┌──────────┐  │
            └─────────────┘  │  │ Qwen 14B│  │  Claude  │  │
                             │  │ (local) │  │ (cloud)  │  │
                             │  └─────────┘  └──────────┘  │
                             └─────────────────────────────┘
                                            │
                                            ▼
                             ┌─────────────────────────────┐
                             │     Validation Layer         │
                             │  - Range checks              │
                             │  - Cross-field validation    │
                             │  - Anomaly detection         │
                             └─────────────────────────────┘
                                            │
                                    ┌───────┴───────┐
                                   Pass           Fail
                                    │               │
                                    ▼               ▼
                            ┌───────────┐   ┌───────────┐
                            │  Output   │   │  Manual   │
                            │  Result   │   │  Review   │
                            └───────────┘   └───────────┘

Key Lessons Learned

1. LLMs Are Not a Silver Bullet

They’re a powerful tool in the toolbox, but they need guardrails. The hallucination problem is real, especially with financial data where accuracy is critical.

2. Hybrid Is the Way

Pure traditional methods fail on edge cases. Pure LLM methods are expensive and risky. The hybrid approach gives you:

Speed and low cost for easy documents
Intelligence for hard documents
Validation for safety

3. Model Selection Matters

Use smaller local models (Qwen 14B) for structured field extraction
Use vision models (Claude, GPT-4) for complex layouts
Always have a validation layer

4. Confidence Scores Are Essential

Every extraction should come with a confidence score. Low confidence triggers the LLM fallback. Very low confidence triggers manual review.

def calculate_confidence(traditional_result: dict, llm_result: dict) -> float:
    """Calculate confidence based on agreement between methods"""
    score = 0.0

    # Check if traditional and LLM agree
    for key in ['revenue', 'cost_of_goods', 'net_income']:
        trad_val = traditional_result.get(key)
        llm_val = llm_result.get(key)

        if trad_val and llm_val:
            # Values within 1% = high confidence
            if abs(trad_val - llm_val) / max(trad_val, llm_val) < 0.01:
                score += 0.3
            # Values within 5% = medium confidence
            elif abs(trad_val - llm_val) / max(trad_val, llm_val) < 0.05:
                score += 0.2
            # Values disagree = low confidence
            else:
                score += 0.1

    # Check for reasonable ranges
    if llm_result.get('revenue', 0) < 1_000_000_000_000:  # < $1T
        score += 0.1

    return min(score, 1.0)

When to Use LLMs for PDF Extraction

Use LLMs when:

PDFs have inconsistent formatting
Tables span multiple pages
Documents contain mixed content (text, tables, images)
Traditional methods give < 80% accuracy
You can tolerate some validation overhead

Stick with traditional methods when:

PDFs are consistently formatted
You need 100% accuracy (no hallucination tolerance)
Processing millions of documents (cost matters)
Real-time extraction with low latency requirements

Final Thoughts

LLMs transformed my financial PDF extraction pipeline from a frustrating 60% accuracy mess to a reliable 92% system. But it wasn’t magic. It required:

A solid traditional extraction foundation
Intelligent fallback logic
Rigorous validation
Careful model selection

The key insight: LLMs excel at understanding context and handling edge cases, but they need supervision. Use them as a post-processing layer, not a replacement for traditional methods.

If you’re building a financial data extraction system, start with pdfplumber or PyMuPDF. When you hit the accuracy wall (and you will), add an LLM layer. But always validate the output. Your finance team will thank you.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Should You Use LLMs for Financial PDF Data Extraction?

The Traditional Approach (And Why It Fails)

Enter LLMs

First Attempt: Claude Vision

The Hallucination Problem

Building a Hybrid Pipeline

Model Selection: Which LLM to Use?

Claude Vision (claude-3-5-sonnet)

GPT-4 Vision

Qwen 14B (Local)

The Results

Production Architecture

Key Lessons Learned

1. LLMs Are Not a Silver Bullet

2. Hybrid Is the Way

3. Model Selection Matters

4. Confidence Scores Are Essential

When to Use LLMs for PDF Extraction

Final Thoughts

Final Words + More Resources

Comments