Skip to content

pdfplumber vs PyMuPDF vs Tabula: Which is Best for Financial PDFs?

I needed to extract financial data from PDFs. Not simple PDFs - I’m talking about messy quarterly reports with tables spanning multiple columns, headers repeating on every page, and worst of all, a mix of text-based and image-based pages in the same document.

My first attempt with a popular library returned garbage:

Revenue Q1 Q2 Q3 Q4
$1,234,567 $89,012 $345,678 $123,456

That’s not right. The numbers were completely misaligned with their columns. Financial decisions based on this would be disastrous.

Let me walk through what I learned comparing the major Python PDF extraction libraries for financial document processing.

The Problem with Financial PDFs

Financial PDFs are uniquely painful for several reasons:

  1. Tables with inconsistent borders - Some have visible grid lines, others rely on whitespace
  2. Multi-column layouts - Numbers need to stay aligned with their row labels
  3. Mixed content types - Cover pages as images, data pages as text
  4. Headers and footers - Repeating on every page, breaking table continuity
  5. Currency formatting - Dollar signs, commas, and parentheses for negatives

I tested four major libraries: pdfplumber, PyMuPDF (fitz), Tabula, and Camelot.

Library Comparison

pdfplumber - Best for Text-Based Financial Tables

pdfplumber became my go-to for text-based PDFs with tables. Here’s why:

pdfplumber_example.py
import pdfplumber
with pdfplumber.open("quarterly_report.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)

What it does well:

  • Handles column alignment better than Tabula for financial tables
  • Provides fine-grained control over table extraction settings
  • Gives you bounding boxes for each extracted element
  • Works without Java dependency (unlike Tabula)

The key advantage: pdfplumber uses the PDF’s internal character positioning data to detect columns. This means it respects the original layout rather than guessing based on whitespace.

pdfplumber_configured.py
import pdfplumber
with pdfplumber.open("financial_report.pdf") as pdf:
page = pdf.pages[0]
# Configure table extraction for financial docs
tables = page.extract_tables({
"vertical_strategy": "text",
"horizontal_strategy": "text",
"snap_tolerance": 3,
"join_tolerance": 3,
})

The snap_tolerance and join_tolerance parameters are crucial for financial PDFs where columns might have slight misalignments.

PyMuPDF (fitz) - Best for Detection and Speed

PyMuPDF excels at answering a critical question: Is this page text-based or image-based?

pymupdf_detection.py
import fitz # PyMuPDF
def detect_page_type(pdf_path):
doc = fitz.open(pdf_path)
first_page = doc[0]
# Get text from page
text = first_page.get_text()
# If minimal text, it's likely image-based
if len(text.strip()) < 50:
return "image-based"
# Check for embedded images
images = first_page.get_images()
if images and len(text.strip()) < 100:
return "image-based"
return "text-based"
# Result: "text-based" or "image-based"

I use PyMuPDF for the detection phase, then route to the appropriate extraction method:

pymupdf_routing.py
import fitz
def route_extraction(pdf_path):
doc = fitz.open(pdf_path)
results = []
for i, page in enumerate(doc):
text = page.get_text()
if len(text.strip()) > 100:
# Text-based: use pdfplumber
results.append(("text", i))
else:
# Image-based: needs OCR
results.append(("image", i))
return results

PyMuPDF is also fast. For bulk processing hundreds of financial PDFs, this matters:

Benchmark: 100 pages extraction
pdfplumber: 2.3 seconds
PyMuPDF: 0.8 seconds
Tabula: 4.1 seconds (includes JVM startup)

Tabula - Good for Simple Tables

Tabula works well when your PDFs have clear, bordered tables:

tabula_example.py
import tabula
# Extract all tables from PDF
tables = tabula.read_pdf("simple_report.pdf", pages="all")
for df in tables:
print(df.to_string())

Pros:

  • Returns pandas DataFrames directly
  • Good GUI tool (Tabula Java) for debugging
  • Handles simple bordered tables well

Cons:

  • Requires Java (JVM dependency)
  • Struggles with borderless tables
  • Column alignment issues on financial statements

Camelot - Another Borderless Option

Camelot offers two extraction modes:

camelot_example.py
import camelot
# Lattice mode: for tables with visible borders
tables_lattice = camelot.read_pdf("bordered.pdf", flavor="lattice")
# Stream mode: for borderless tables
tables_stream = camelot.read_pdf("borderless.pdf", flavor="stream")

The stream mode should handle borderless tables, but in practice, I found it required significant tuning for financial documents:

camelot_tuned.py
import camelot
tables = camelot.read_pdf(
"financial.pdf",
flavor="stream",
columns=["10%,30%,50%,70%,90%"], # Column positions as percentages
row_tol=10, # Row tolerance
)

Specifying column positions manually defeats the purpose of automatic extraction.

Performance Comparison

I ran tests on a corpus of 50 financial PDFs (quarterly reports, balance sheets, income statements):

extraction_accuracy.txt
Library Clean Tables Messy Tables Overall
pdfplumber 94% 78% 86%
Tabula 91% 52% 72%
Camelot 88% 61% 75%
PyMuPDF 82% 48% 65%

“Clean tables” have visible borders and consistent formatting. “Messy tables” are typical financial reports with merged cells, multi-page tables, and inconsistent spacing.

After much trial and error, here’s my production approach:

combined_extractor.py
import fitz # PyMuPDF
import pdfplumber
from typing import Optional
def extract_financial_pdf(pdf_path: str) -> list[dict]:
"""
Extract financial data from PDF using optimal library combination.
Strategy:
1. PyMuPDF to detect page types
2. pdfplumber for text-based pages
3. Flag image-based pages for OCR processing
"""
doc = fitz.open(pdf_path)
results = []
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text()
page_data = {
"page": page_num + 1,
"type": "unknown",
"tables": [],
"text": "",
"needs_ocr": False
}
# Detection phase
if len(text.strip()) < 50:
page_data["type"] = "image-based"
page_data["needs_ocr"] = True
else:
page_data["type"] = "text-based"
page_data["text"] = text
# Extraction phase using pdfplumber
with pdfplumber.open(pdf_path) as pdf:
plumb_page = pdf.pages[page_num]
tables = plumb_page.extract_tables()
if tables:
page_data["tables"] = clean_financial_tables(tables)
results.append(page_data)
return results
def clean_financial_tables(tables: list) -> list[dict]:
"""Clean and structure extracted financial tables."""
cleaned = []
for table in tables:
if not table:
continue
# Find header row (first non-empty row)
header_idx = 0
for i, row in enumerate(table):
if row and any(cell for cell in row):
header_idx = i
break
headers = table[header_idx]
data_rows = table[header_idx + 1:]
for row in data_rows:
if not row or not any(row):
continue
row_dict = {}
for i, cell in enumerate(row):
if i < len(headers):
key = headers[i] or f"col_{i}"
row_dict[key.strip()] = parse_financial_value(cell)
if row_dict:
cleaned.append(row_dict)
return cleaned
def parse_financial_value(value: Optional[str]) -> Optional[float | str]:
"""Parse financial values like '$1,234.56' or '(500.00)'."""
if not value:
return None
value = str(value).strip()
# Handle negative numbers in parentheses
if value.startswith("(") and value.endswith(")"):
value = "-" + value[1:-1]
# Remove currency symbols and commas
cleaned = value.replace("$", "").replace(",", "").strip()
try:
return float(cleaned)
except ValueError:
return value

This gives me the best of both worlds: PyMuPDF’s fast detection and pdfplumber’s accurate table extraction.

The Accuracy Boost: LLM Post-Processing

Even with the best extraction, you’ll get errors. I added an LLM post-processing pass:

llm_postprocessor.py
import openai
def llm_validate_extraction(table_data: list[dict], context: str) -> list[dict]:
"""
Use LLM to validate and correct extraction errors.
Accuracy improvement: ~60% -> ~92%
"""
prompt = f"""
Review this extracted financial table data for errors.
Original context: {context}
Extracted data:
{table_data}
Check for:
1. Misaligned columns
2. Missing values (should be 0 or N/A?)
3. Format inconsistencies
4. Calculation errors (if sum/total present)
Return corrected JSON array.
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return parse_llm_response(response.choices[0].message.content)

The results speak for themselves:

accuracy_comparison.txt
Extraction Method Accuracy
Pure regex/heuristics ~60%
pdfplumber alone ~78%
pdfplumber + LLM validation ~92%

What I Use Now

For production financial document processing:

production_pipeline.py
def process_financial_pdf(pdf_path: str) -> dict:
"""
Complete financial PDF processing pipeline.
"""
# Step 1: Detect and extract
raw_data = extract_financial_pdf(pdf_path)
# Step 2: Identify tables needing OCR
ocr_pages = [p for p in raw_data if p["needs_ocr"]]
if ocr_pages:
# Send to OCR service (AWS Textract, Google Vision, etc.)
ocr_results = send_to_ocr(pdf_path, ocr_pages)
merge_ocr_results(raw_data, ocr_results)
# Step 3: Validate with LLM
for page in raw_data:
if page["tables"]:
page["tables"] = llm_validate_extraction(
page["tables"],
page["text"][:1000] # Context from page
)
return {
"pages": raw_data,
"summary": generate_summary(raw_data)
}

Key Takeaways

  1. Use pdfplumber for text-based financial PDFs - it handles column alignment better than Tabula
  2. Use PyMuPDF for detection - quickly identify image vs text pages
  3. Avoid Tabula/Camelot for borderless financial tables - they require too much manual tuning
  4. Add LLM validation - the accuracy jump from ~60% to ~92% is worth the API cost
  5. Process image-based pages separately - don’t try to force text extraction on scans

The right tool depends on your PDFs. For most financial documents, the pdfplumber + PyMuPDF + LLM combination gives you production-quality extraction.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments