Skip to content

Best Python Library for Extracting Tables from Inconsistent Financial PDFs?

I stared at the PDF on my screen. Another annual report with tables that looked perfect visually but were a nightmare programmatically. No borders. Misaligned columns. Rows broken across pages. My tabula-py script returned garbage.

Terminal output
$ python extract_tables.py annual_report.pdf
[] # Empty. Again.

If you’ve tried extracting tables from financial PDFs, you know the pain. Let me walk through what actually works.

The Problem with PDF Tables

Here’s what most people don’t realize: there is no standard table construct in the PDF format. PDFs store positioning data, not structural data. A table you see on screen is just a bunch of text positioned at specific coordinates.

When I first tried extracting tables, I assumed libraries would “just work”:

extract_tables.py
import tabula
# This should work, right?
tables = tabula.read_pdf("annual_report.pdf", pages="all")
print(f"Found {len(tables)} tables")
Output
Found 0 tables

Zero tables. But I could clearly see tables in the document!

Why Financial PDFs Are Hard

Financial documents have unique challenges:

Common issues in financial PDFs
+----------------------------------+----------------------------------------+
| Issue | Example |
+----------------------------------+----------------------------------------+
| Invisible borders | Tables with no grid lines |
| Misaligned columns | Numbers shifted left/right |
| Multi-line rows | Long text wrapping to next line |
| Inconsistent spacing | Some cells have extra padding |
| Merged cells | Category headers spanning columns |
| Mixed content | Numbers, text, percentages together |
+----------------------------------+----------------------------------------+

I tested the major Python libraries. Here’s what I found.

Library 1: Tabula-py

tabula_attempt.py
import tabula
tables = tabula.read_pdf(
"annual_report.pdf",
pages="all",
lattice=True, # For tables with visible borders
)

Result: Works great for tables with visible grid lines. Fails silently on the invisible-border tables that dominate annual reports.

Switching to lattice=False (stream mode) for borderless tables:

tabula_stream.py
tables = tabula.read_pdf(
"annual_report.pdf",
pages="all",
lattice=False,
columns=[100, 200, 300, 400] # Need to guess column positions
)

Problem: You have to manually specify column positions. Not scalable for batch processing.

Library 2: Camelot

Camelot offers two modes: lattice (detects lines) and stream (detects whitespace).

camelot_attempt.py
import camelot
# Lattice mode for bordered tables
tables = camelot.read_pdf(
"annual_report.pdf",
pages="1-5",
flavor="lattice"
)
print(f"Lattice accuracy: {tables[0].accuracy if tables else 0}%")
Output
Lattice accuracy: 0%

Zero percent. Because my tables had no visible borders.

Trying stream mode:

camelot_stream.py
tables = camelot.read_pdf(
"annual_report.pdf",
pages="1-5",
flavor="stream"
)

Result: Better, but still struggled with column alignment. Financial tables often have numbers that “bleed” into adjacent columns.

Library 3: pdfplumber (The Winner for Text-Based PDFs)

pdfplumber takes a different approach. It exposes the raw character-level positioning data, letting you build tables from the ground up.

pdfplumber_basic.py
import pdfplumber
with pdfplumber.open("annual_report.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
Sample output
['Revenue', '2024', '2023', 'Change']
['Product Sales', '$1,234,567', '$1,100,000', '+12.2%']
['Services', '$567,890', '$500,000', '+13.6%']

Why pdfplumber wins: It handles column alignment better than any other library. Instead of guessing where columns should be, it uses the actual character positions.

For fine-grained control:

pdfplumber_settings.py
import pdfplumber
with pdfplumber.open("annual_report.pdf") as pdf:
page = pdf.pages[0]
# Customize table extraction settings
tables = page.extract_tables({
"vertical_strategy": "text", # Use text positioning
"horizontal_strategy": "text",
"snap_tolerance": 5, # Merge close vertical lines
"join_tolerance": 5, # Merge close horizontal lines
"edge_min_length": 3, # Minimum line length
})

The Real Solution: pdfplumber + LLM

For the messiest pages, even pdfplumber returned garbled data. Accuracy hovered around 60%.

I tried a hybrid approach: let pdfplumber do the heavy lifting, then use an LLM to clean up the results.

llm_table_cleaner.py
import pdfplumber
from openai import OpenAI
client = OpenAI()
def extract_table_with_llm_cleanup(pdf_path, page_num):
# Step 1: Extract with pdfplumber
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[page_num]
raw_text = page.extract_text()
raw_tables = page.extract_tables()
# Step 2: Use LLM to structure the messy data
prompt = f"""
Extract the financial table from this PDF text.
Return as JSON with columns: category, current_year, prior_year, change.
Raw text:
{raw_text}
Raw table extraction (may have errors):
{raw_tables}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return response.choices[0].message.content

Result: Accuracy jumped from 60% to 92% on messy pages.

The LLM handles:

  • Merged cells that confuse traditional parsers
  • Multi-line row content
  • Missing/extra whitespace
  • Inconsistent number formatting

Complete Working Example

Here’s my production-ready approach:

financial_table_extractor.py
import pdfplumber
import json
from typing import Optional
from openai import OpenAI
class FinancialTableExtractor:
def __init__(self, use_llm: bool = False):
self.use_llm = use_llm
self.client = OpenAI() if use_llm else None
def extract(self, pdf_path: str, pages: str = "all") -> list[dict]:
"""Extract tables from a financial PDF."""
results = []
with pdfplumber.open(pdf_path) as pdf:
page_nums = self._parse_pages(pages, len(pdf.pages))
for i in page_nums:
page = pdf.pages[i]
tables = self._extract_page(page, i)
results.extend(tables)
return results
def _parse_pages(self, pages: str, max_pages: int) -> list[int]:
"""Parse page specification like '1-5, 7, 10'."""
if pages == "all":
return list(range(max_pages))
page_nums = []
for part in pages.split(","):
part = part.strip()
if "-" in part:
start, end = map(int, part.split("-"))
page_nums.extend(range(start - 1, end))
else:
page_nums.append(int(part) - 1)
return page_nums
def _extract_page(self, page, page_num: int) -> list[dict]:
"""Extract tables from a single page."""
# Try pdfplumber first
raw_tables = page.extract_tables({
"vertical_strategy": "text",
"horizontal_strategy": "text",
"snap_tolerance": 5,
})
if not raw_tables:
return []
tables = []
for i, table in enumerate(raw_tables):
if self._is_table_valid(table):
if self.use_llm:
table = self._llm_cleanup(table, page.extract_text())
tables.append({
"page": page_num + 1,
"table_index": i,
"data": table
})
return tables
def _is_table_valid(self, table: list) -> bool:
"""Check if extracted table has meaningful content."""
if not table or len(table) < 2:
return False
# Must have at least 2 columns and 2 rows
if len(table[0]) < 2:
return False
# At least half the cells should have content
total_cells = sum(len(row) for row in table)
non_empty = sum(
1 for row in table
for cell in row
if cell and cell.strip()
)
return non_empty / total_cells > 0.5 if total_cells > 0 else False
def _llm_cleanup(self, table: list, context: str) -> list[list]:
"""Use LLM to fix table extraction errors."""
prompt = f"""
Clean up this extracted financial table.
Fix any merged cells, misaligned columns, or data issues.
Return only the corrected table as a JSON array of arrays.
Context from page:
{context[:2000]}
Extracted table (may have errors):
{json.dumps(table)}
"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return result.get("table", table)
# Usage
if __name__ == "__main__":
# Without LLM (faster, good for clean PDFs)
extractor = FinancialTableExtractor(use_llm=False)
tables = extractor.extract("annual_report.pdf", pages="1-10")
# With LLM (slower, better for messy PDFs)
extractor_llm = FinancialTableExtractor(use_llm=True)
tables = extractor_llm.extract("messy_annual_report.pdf", pages="5-8")
print(json.dumps(tables, indent=2))

Decision Matrix

When to use which approach
+-------------------------+-------------------+-------------------+-------------------+
| Scenario | pdfplumber alone | pdfplumber + LLM | OCR (pytesseract) |
+-------------------------+-------------------+-------------------+-------------------+
| Clean tables, borders | Excellent | Overkill | Don't bother |
| Invisible borders | Good | Excellent | Poor |
| Misaligned columns | Good | Excellent | Poor |
| Scanned PDF (images) | Won't work | Won't work | Required |
| Batch processing | Fast | Slow/costly | Slow |
| Accuracy requirement | 70-85% | 90%+ | 60-80% |
+-------------------------+-------------------+-------------------+-------------------+

Key Takeaways

  1. Start with pdfplumber for text-based financial PDFs. It handles column alignment better than tabula or camelot.

  2. Lattice/stream modes in Camelot and Tabula fail on borderless tables, which are common in annual reports.

  3. For messy pages, add an LLM cleanup step. It bumps accuracy from ~60% to ~90%+.

  4. For scanned PDFs, you need OCR (pytesseract + cv2 for preprocessing) before any table extraction.

  5. There’s no silver bullet. The “best” library depends on your PDFs’ specific quirks.

requirements.txt
pdfplumber>=0.10.0
openai>=1.0.0
tabula-py>=2.9.0 # Optional, for bordered tables
camelot-py>=0.11.0 # Optional, alternative approach

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments