Best Python Library for Extracting Tables from Inconsistent Financial PDFs?

Mar 16, 2026

I stared at the PDF on my screen. Another annual report with tables that looked perfect visually but were a nightmare programmatically. No borders. Misaligned columns. Rows broken across pages. My tabula-py script returned garbage.

$ python extract_tables.py annual_report.pdf
[]  # Empty. Again.

If you’ve tried extracting tables from financial PDFs, you know the pain. Let me walk through what actually works.

The Problem with PDF Tables

Here’s what most people don’t realize: there is no standard table construct in the PDF format. PDFs store positioning data, not structural data. A table you see on screen is just a bunch of text positioned at specific coordinates.

When I first tried extracting tables, I assumed libraries would “just work”:

import tabula

# This should work, right?
tables = tabula.read_pdf("annual_report.pdf", pages="all")
print(f"Found {len(tables)} tables")

Found 0 tables

Zero tables. But I could clearly see tables in the document!

Why Financial PDFs Are Hard

Financial documents have unique challenges:

+----------------------------------+----------------------------------------+
| Issue                            | Example                                |
+----------------------------------+----------------------------------------+
| Invisible borders                | Tables with no grid lines              |
| Misaligned columns               | Numbers shifted left/right             |
| Multi-line rows                  | Long text wrapping to next line        |
| Inconsistent spacing             | Some cells have extra padding           |
| Merged cells                     | Category headers spanning columns      |
| Mixed content                    | Numbers, text, percentages together    |
+----------------------------------+----------------------------------------+

I tested the major Python libraries. Here’s what I found.

Library 1: Tabula-py

import tabula

tables = tabula.read_pdf(
    "annual_report.pdf",
    pages="all",
    lattice=True,  # For tables with visible borders
)

Result: Works great for tables with visible grid lines. Fails silently on the invisible-border tables that dominate annual reports.

Switching to lattice=False (stream mode) for borderless tables:

tables = tabula.read_pdf(
    "annual_report.pdf",
    pages="all",
    lattice=False,
    columns=[100, 200, 300, 400]  # Need to guess column positions
)

Problem: You have to manually specify column positions. Not scalable for batch processing.

Library 2: Camelot

Camelot offers two modes: lattice (detects lines) and stream (detects whitespace).

import camelot

# Lattice mode for bordered tables
tables = camelot.read_pdf(
    "annual_report.pdf",
    pages="1-5",
    flavor="lattice"
)
print(f"Lattice accuracy: {tables[0].accuracy if tables else 0}%")

Lattice accuracy: 0%

Zero percent. Because my tables had no visible borders.

Trying stream mode:

tables = camelot.read_pdf(
    "annual_report.pdf",
    pages="1-5",
    flavor="stream"
)

Result: Better, but still struggled with column alignment. Financial tables often have numbers that “bleed” into adjacent columns.

Library 3: pdfplumber (The Winner for Text-Based PDFs)

pdfplumber takes a different approach. It exposes the raw character-level positioning data, letting you build tables from the ground up.

import pdfplumber

with pdfplumber.open("annual_report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        for row in table:
            print(row)

['Revenue', '2024', '2023', 'Change']
['Product Sales', '$1,234,567', '$1,100,000', '+12.2%']
['Services', '$567,890', '$500,000', '+13.6%']

Why pdfplumber wins: It handles column alignment better than any other library. Instead of guessing where columns should be, it uses the actual character positions.

For fine-grained control:

import pdfplumber

with pdfplumber.open("annual_report.pdf") as pdf:
    page = pdf.pages[0]

    # Customize table extraction settings
    tables = page.extract_tables({
        "vertical_strategy": "text",      # Use text positioning
        "horizontal_strategy": "text",
        "snap_tolerance": 5,               # Merge close vertical lines
        "join_tolerance": 5,               # Merge close horizontal lines
        "edge_min_length": 3,              # Minimum line length
    })

The Real Solution: pdfplumber + LLM

For the messiest pages, even pdfplumber returned garbled data. Accuracy hovered around 60%.

I tried a hybrid approach: let pdfplumber do the heavy lifting, then use an LLM to clean up the results.

import pdfplumber
from openai import OpenAI

client = OpenAI()

def extract_table_with_llm_cleanup(pdf_path, page_num):
    # Step 1: Extract with pdfplumber
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_num]
        raw_text = page.extract_text()
        raw_tables = page.extract_tables()

    # Step 2: Use LLM to structure the messy data
    prompt = f"""
    Extract the financial table from this PDF text.
    Return as JSON with columns: category, current_year, prior_year, change.

    Raw text:
    {raw_text}

    Raw table extraction (may have errors):
    {raw_tables}
    """

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    return response.choices[0].message.content

Result: Accuracy jumped from 60% to 92% on messy pages.

The LLM handles:

Merged cells that confuse traditional parsers
Multi-line row content
Missing/extra whitespace
Inconsistent number formatting

Complete Working Example

Here’s my production-ready approach:

import pdfplumber
import json
from typing import Optional
from openai import OpenAI

class FinancialTableExtractor:
    def __init__(self, use_llm: bool = False):
        self.use_llm = use_llm
        self.client = OpenAI() if use_llm else None

    def extract(self, pdf_path: str, pages: str = "all") -> list[dict]:
        """Extract tables from a financial PDF."""
        results = []

        with pdfplumber.open(pdf_path) as pdf:
            page_nums = self._parse_pages(pages, len(pdf.pages))

            for i in page_nums:
                page = pdf.pages[i]
                tables = self._extract_page(page, i)
                results.extend(tables)

        return results

    def _parse_pages(self, pages: str, max_pages: int) -> list[int]:
        """Parse page specification like '1-5, 7, 10'."""
        if pages == "all":
            return list(range(max_pages))

        page_nums = []
        for part in pages.split(","):
            part = part.strip()
            if "-" in part:
                start, end = map(int, part.split("-"))
                page_nums.extend(range(start - 1, end))
            else:
                page_nums.append(int(part) - 1)
        return page_nums

    def _extract_page(self, page, page_num: int) -> list[dict]:
        """Extract tables from a single page."""
        # Try pdfplumber first
        raw_tables = page.extract_tables({
            "vertical_strategy": "text",
            "horizontal_strategy": "text",
            "snap_tolerance": 5,
        })

        if not raw_tables:
            return []

        tables = []
        for i, table in enumerate(raw_tables):
            if self._is_table_valid(table):
                if self.use_llm:
                    table = self._llm_cleanup(table, page.extract_text())
                tables.append({
                    "page": page_num + 1,
                    "table_index": i,
                    "data": table
                })

        return tables

    def _is_table_valid(self, table: list) -> bool:
        """Check if extracted table has meaningful content."""
        if not table or len(table) < 2:
            return False

        # Must have at least 2 columns and 2 rows
        if len(table[0]) < 2:
            return False

        # At least half the cells should have content
        total_cells = sum(len(row) for row in table)
        non_empty = sum(
            1 for row in table
            for cell in row
            if cell and cell.strip()
        )

        return non_empty / total_cells > 0.5 if total_cells > 0 else False

    def _llm_cleanup(self, table: list, context: str) -> list[list]:
        """Use LLM to fix table extraction errors."""
        prompt = f"""
        Clean up this extracted financial table.
        Fix any merged cells, misaligned columns, or data issues.
        Return only the corrected table as a JSON array of arrays.

        Context from page:
        {context[:2000]}

        Extracted table (may have errors):
        {json.dumps(table)}
        """

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        result = json.loads(response.choices[0].message.content)
        return result.get("table", table)


# Usage
if __name__ == "__main__":
    # Without LLM (faster, good for clean PDFs)
    extractor = FinancialTableExtractor(use_llm=False)
    tables = extractor.extract("annual_report.pdf", pages="1-10")

    # With LLM (slower, better for messy PDFs)
    extractor_llm = FinancialTableExtractor(use_llm=True)
    tables = extractor_llm.extract("messy_annual_report.pdf", pages="5-8")

    print(json.dumps(tables, indent=2))

Decision Matrix

+-------------------------+-------------------+-------------------+-------------------+
| Scenario                | pdfplumber alone  | pdfplumber + LLM  | OCR (pytesseract) |
+-------------------------+-------------------+-------------------+-------------------+
| Clean tables, borders   | Excellent         | Overkill          | Don't bother      |
| Invisible borders       | Good              | Excellent         | Poor              |
| Misaligned columns      | Good              | Excellent         | Poor              |
| Scanned PDF (images)    | Won't work        | Won't work        | Required          |
| Batch processing        | Fast              | Slow/costly       | Slow              |
| Accuracy requirement    | 70-85%            | 90%+              | 60-80%            |
+-------------------------+-------------------+-------------------+-------------------+

Key Takeaways

Start with pdfplumber for text-based financial PDFs. It handles column alignment better than tabula or camelot.
Lattice/stream modes in Camelot and Tabula fail on borderless tables, which are common in annual reports.
For messy pages, add an LLM cleanup step. It bumps accuracy from ~60% to ~90%+.
For scanned PDFs, you need OCR (pytesseract + cv2 for preprocessing) before any table extraction.
There’s no silver bullet. The “best” library depends on your PDFs’ specific quirks.

pdfplumber>=0.10.0
openai>=1.0.0
tabula-py>=2.9.0  # Optional, for bordered tables
camelot-py>=0.11.0  # Optional, alternative approach

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!