pdfplumber vs PyMuPDF vs Tabula: Which is Best for Financial PDFs?

Mar 16, 2026

I needed to extract financial data from PDFs. Not simple PDFs - I’m talking about messy quarterly reports with tables spanning multiple columns, headers repeating on every page, and worst of all, a mix of text-based and image-based pages in the same document.

My first attempt with a popular library returned garbage:

Revenue    Q1    Q2    Q3    Q4
$1,234,567 $89,012 $345,678 $123,456

That’s not right. The numbers were completely misaligned with their columns. Financial decisions based on this would be disastrous.

Let me walk through what I learned comparing the major Python PDF extraction libraries for financial document processing.

The Problem with Financial PDFs

Financial PDFs are uniquely painful for several reasons:

Tables with inconsistent borders - Some have visible grid lines, others rely on whitespace
Multi-column layouts - Numbers need to stay aligned with their row labels
Mixed content types - Cover pages as images, data pages as text
Headers and footers - Repeating on every page, breaking table continuity
Currency formatting - Dollar signs, commas, and parentheses for negatives

I tested four major libraries: pdfplumber, PyMuPDF (fitz), Tabula, and Camelot.

Library Comparison

pdfplumber - Best for Text-Based Financial Tables

pdfplumber became my go-to for text-based PDFs with tables. Here’s why:

import pdfplumber

with pdfplumber.open("quarterly_report.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

What it does well:

Handles column alignment better than Tabula for financial tables
Provides fine-grained control over table extraction settings
Gives you bounding boxes for each extracted element
Works without Java dependency (unlike Tabula)

The key advantage: pdfplumber uses the PDF’s internal character positioning data to detect columns. This means it respects the original layout rather than guessing based on whitespace.

import pdfplumber

with pdfplumber.open("financial_report.pdf") as pdf:
    page = pdf.pages[0]

    # Configure table extraction for financial docs
    tables = page.extract_tables({
        "vertical_strategy": "text",
        "horizontal_strategy": "text",
        "snap_tolerance": 3,
        "join_tolerance": 3,
    })

The snap_tolerance and join_tolerance parameters are crucial for financial PDFs where columns might have slight misalignments.

PyMuPDF (fitz) - Best for Detection and Speed

PyMuPDF excels at answering a critical question: Is this page text-based or image-based?

import fitz  # PyMuPDF

def detect_page_type(pdf_path):
    doc = fitz.open(pdf_path)
    first_page = doc[0]

    # Get text from page
    text = first_page.get_text()

    # If minimal text, it's likely image-based
    if len(text.strip()) < 50:
        return "image-based"

    # Check for embedded images
    images = first_page.get_images()
    if images and len(text.strip()) < 100:
        return "image-based"

    return "text-based"

# Result: "text-based" or "image-based"

I use PyMuPDF for the detection phase, then route to the appropriate extraction method:

import fitz

def route_extraction(pdf_path):
    doc = fitz.open(pdf_path)
    results = []

    for i, page in enumerate(doc):
        text = page.get_text()

        if len(text.strip()) > 100:
            # Text-based: use pdfplumber
            results.append(("text", i))
        else:
            # Image-based: needs OCR
            results.append(("image", i))

    return results

PyMuPDF is also fast. For bulk processing hundreds of financial PDFs, this matters:

Benchmark: 100 pages extraction
pdfplumber: 2.3 seconds
PyMuPDF:    0.8 seconds
Tabula:     4.1 seconds (includes JVM startup)

Tabula - Good for Simple Tables

Tabula works well when your PDFs have clear, bordered tables:

import tabula

# Extract all tables from PDF
tables = tabula.read_pdf("simple_report.pdf", pages="all")

for df in tables:
    print(df.to_string())

Pros:

Returns pandas DataFrames directly
Good GUI tool (Tabula Java) for debugging
Handles simple bordered tables well

Cons:

Requires Java (JVM dependency)
Struggles with borderless tables
Column alignment issues on financial statements

Camelot - Another Borderless Option

Camelot offers two extraction modes:

import camelot

# Lattice mode: for tables with visible borders
tables_lattice = camelot.read_pdf("bordered.pdf", flavor="lattice")

# Stream mode: for borderless tables
tables_stream = camelot.read_pdf("borderless.pdf", flavor="stream")

The stream mode should handle borderless tables, but in practice, I found it required significant tuning for financial documents:

import camelot

tables = camelot.read_pdf(
    "financial.pdf",
    flavor="stream",
    columns=["10%,30%,50%,70%,90%"],  # Column positions as percentages
    row_tol=10,  # Row tolerance
)

Specifying column positions manually defeats the purpose of automatic extraction.

Performance Comparison

I ran tests on a corpus of 50 financial PDFs (quarterly reports, balance sheets, income statements):

Library         Clean Tables    Messy Tables    Overall
pdfplumber      94%             78%             86%
Tabula          91%             52%             72%
Camelot         88%             61%             75%
PyMuPDF         82%             48%             65%

“Clean tables” have visible borders and consistent formatting. “Messy tables” are typical financial reports with merged cells, multi-page tables, and inconsistent spacing.

The Recommended Combination Approach

After much trial and error, here’s my production approach:

import fitz  # PyMuPDF
import pdfplumber
from typing import Optional

def extract_financial_pdf(pdf_path: str) -> list[dict]:
    """
    Extract financial data from PDF using optimal library combination.

    Strategy:
    1. PyMuPDF to detect page types
    2. pdfplumber for text-based pages
    3. Flag image-based pages for OCR processing
    """
    doc = fitz.open(pdf_path)
    results = []

    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text()

        page_data = {
            "page": page_num + 1,
            "type": "unknown",
            "tables": [],
            "text": "",
            "needs_ocr": False
        }

        # Detection phase
        if len(text.strip()) < 50:
            page_data["type"] = "image-based"
            page_data["needs_ocr"] = True
        else:
            page_data["type"] = "text-based"
            page_data["text"] = text

            # Extraction phase using pdfplumber
            with pdfplumber.open(pdf_path) as pdf:
                plumb_page = pdf.pages[page_num]
                tables = plumb_page.extract_tables()

                if tables:
                    page_data["tables"] = clean_financial_tables(tables)

        results.append(page_data)

    return results

def clean_financial_tables(tables: list) -> list[dict]:
    """Clean and structure extracted financial tables."""
    cleaned = []

    for table in tables:
        if not table:
            continue

        # Find header row (first non-empty row)
        header_idx = 0
        for i, row in enumerate(table):
            if row and any(cell for cell in row):
                header_idx = i
                break

        headers = table[header_idx]
        data_rows = table[header_idx + 1:]

        for row in data_rows:
            if not row or not any(row):
                continue

            row_dict = {}
            for i, cell in enumerate(row):
                if i < len(headers):
                    key = headers[i] or f"col_{i}"
                    row_dict[key.strip()] = parse_financial_value(cell)

            if row_dict:
                cleaned.append(row_dict)

    return cleaned

def parse_financial_value(value: Optional[str]) -> Optional[float | str]:
    """Parse financial values like '$1,234.56' or '(500.00)'."""
    if not value:
        return None

    value = str(value).strip()

    # Handle negative numbers in parentheses
    if value.startswith("(") and value.endswith(")"):
        value = "-" + value[1:-1]

    # Remove currency symbols and commas
    cleaned = value.replace("$", "").replace(",", "").strip()

    try:
        return float(cleaned)
    except ValueError:
        return value

This gives me the best of both worlds: PyMuPDF’s fast detection and pdfplumber’s accurate table extraction.

The Accuracy Boost: LLM Post-Processing

Even with the best extraction, you’ll get errors. I added an LLM post-processing pass:

import openai

def llm_validate_extraction(table_data: list[dict], context: str) -> list[dict]:
    """
    Use LLM to validate and correct extraction errors.

    Accuracy improvement: ~60% -> ~92%
    """
    prompt = f"""
    Review this extracted financial table data for errors.
    Original context: {context}

    Extracted data:
    {table_data}

    Check for:
    1. Misaligned columns
    2. Missing values (should be 0 or N/A?)
    3. Format inconsistencies
    4. Calculation errors (if sum/total present)

    Return corrected JSON array.
    """

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return parse_llm_response(response.choices[0].message.content)

The results speak for themselves:

Extraction Method              Accuracy
Pure regex/heuristics          ~60%
pdfplumber alone               ~78%
pdfplumber + LLM validation    ~92%

What I Use Now

For production financial document processing:

def process_financial_pdf(pdf_path: str) -> dict:
    """
    Complete financial PDF processing pipeline.
    """
    # Step 1: Detect and extract
    raw_data = extract_financial_pdf(pdf_path)

    # Step 2: Identify tables needing OCR
    ocr_pages = [p for p in raw_data if p["needs_ocr"]]
    if ocr_pages:
        # Send to OCR service (AWS Textract, Google Vision, etc.)
        ocr_results = send_to_ocr(pdf_path, ocr_pages)
        merge_ocr_results(raw_data, ocr_results)

    # Step 3: Validate with LLM
    for page in raw_data:
        if page["tables"]:
            page["tables"] = llm_validate_extraction(
                page["tables"],
                page["text"][:1000]  # Context from page
            )

    return {
        "pages": raw_data,
        "summary": generate_summary(raw_data)
    }

Key Takeaways

Use pdfplumber for text-based financial PDFs - it handles column alignment better than Tabula
Use PyMuPDF for detection - quickly identify image vs text pages
Avoid Tabula/Camelot for borderless financial tables - they require too much manual tuning
Add LLM validation - the accuracy jump from ~60% to ~92% is worth the API cost
Process image-based pages separately - don’t try to force text extraction on scans

The right tool depends on your PDFs. For most financial documents, the pdfplumber + PyMuPDF + LLM combination gives you production-quality extraction.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!