pdfplumber vs PyMuPDF vs Tabula: Which is Best for Financial PDFs?
I needed to extract financial data from PDFs. Not simple PDFs - I’m talking about messy quarterly reports with tables spanning multiple columns, headers repeating on every page, and worst of all, a mix of text-based and image-based pages in the same document.
My first attempt with a popular library returned garbage:
Revenue Q1 Q2 Q3 Q4$1,234,567 $89,012 $345,678 $123,456That’s not right. The numbers were completely misaligned with their columns. Financial decisions based on this would be disastrous.
Let me walk through what I learned comparing the major Python PDF extraction libraries for financial document processing.
The Problem with Financial PDFs
Financial PDFs are uniquely painful for several reasons:
- Tables with inconsistent borders - Some have visible grid lines, others rely on whitespace
- Multi-column layouts - Numbers need to stay aligned with their row labels
- Mixed content types - Cover pages as images, data pages as text
- Headers and footers - Repeating on every page, breaking table continuity
- Currency formatting - Dollar signs, commas, and parentheses for negatives
I tested four major libraries: pdfplumber, PyMuPDF (fitz), Tabula, and Camelot.
Library Comparison
pdfplumber - Best for Text-Based Financial Tables
pdfplumber became my go-to for text-based PDFs with tables. Here’s why:
import pdfplumber
with pdfplumber.open("quarterly_report.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() for table in tables: for row in table: print(row)What it does well:
- Handles column alignment better than Tabula for financial tables
- Provides fine-grained control over table extraction settings
- Gives you bounding boxes for each extracted element
- Works without Java dependency (unlike Tabula)
The key advantage: pdfplumber uses the PDF’s internal character positioning data to detect columns. This means it respects the original layout rather than guessing based on whitespace.
import pdfplumber
with pdfplumber.open("financial_report.pdf") as pdf: page = pdf.pages[0]
# Configure table extraction for financial docs tables = page.extract_tables({ "vertical_strategy": "text", "horizontal_strategy": "text", "snap_tolerance": 3, "join_tolerance": 3, })The snap_tolerance and join_tolerance parameters are crucial for financial PDFs where columns might have slight misalignments.
PyMuPDF (fitz) - Best for Detection and Speed
PyMuPDF excels at answering a critical question: Is this page text-based or image-based?
import fitz # PyMuPDF
def detect_page_type(pdf_path): doc = fitz.open(pdf_path) first_page = doc[0]
# Get text from page text = first_page.get_text()
# If minimal text, it's likely image-based if len(text.strip()) < 50: return "image-based"
# Check for embedded images images = first_page.get_images() if images and len(text.strip()) < 100: return "image-based"
return "text-based"
# Result: "text-based" or "image-based"I use PyMuPDF for the detection phase, then route to the appropriate extraction method:
import fitz
def route_extraction(pdf_path): doc = fitz.open(pdf_path) results = []
for i, page in enumerate(doc): text = page.get_text()
if len(text.strip()) > 100: # Text-based: use pdfplumber results.append(("text", i)) else: # Image-based: needs OCR results.append(("image", i))
return resultsPyMuPDF is also fast. For bulk processing hundreds of financial PDFs, this matters:
Benchmark: 100 pages extractionpdfplumber: 2.3 secondsPyMuPDF: 0.8 secondsTabula: 4.1 seconds (includes JVM startup)Tabula - Good for Simple Tables
Tabula works well when your PDFs have clear, bordered tables:
import tabula
# Extract all tables from PDFtables = tabula.read_pdf("simple_report.pdf", pages="all")
for df in tables: print(df.to_string())Pros:
- Returns pandas DataFrames directly
- Good GUI tool (Tabula Java) for debugging
- Handles simple bordered tables well
Cons:
- Requires Java (JVM dependency)
- Struggles with borderless tables
- Column alignment issues on financial statements
Camelot - Another Borderless Option
Camelot offers two extraction modes:
import camelot
# Lattice mode: for tables with visible borderstables_lattice = camelot.read_pdf("bordered.pdf", flavor="lattice")
# Stream mode: for borderless tablestables_stream = camelot.read_pdf("borderless.pdf", flavor="stream")The stream mode should handle borderless tables, but in practice, I found it required significant tuning for financial documents:
import camelot
tables = camelot.read_pdf( "financial.pdf", flavor="stream", columns=["10%,30%,50%,70%,90%"], # Column positions as percentages row_tol=10, # Row tolerance)Specifying column positions manually defeats the purpose of automatic extraction.
Performance Comparison
I ran tests on a corpus of 50 financial PDFs (quarterly reports, balance sheets, income statements):
Library Clean Tables Messy Tables Overallpdfplumber 94% 78% 86%Tabula 91% 52% 72%Camelot 88% 61% 75%PyMuPDF 82% 48% 65%“Clean tables” have visible borders and consistent formatting. “Messy tables” are typical financial reports with merged cells, multi-page tables, and inconsistent spacing.
The Recommended Combination Approach
After much trial and error, here’s my production approach:
import fitz # PyMuPDFimport pdfplumberfrom typing import Optional
def extract_financial_pdf(pdf_path: str) -> list[dict]: """ Extract financial data from PDF using optimal library combination.
Strategy: 1. PyMuPDF to detect page types 2. pdfplumber for text-based pages 3. Flag image-based pages for OCR processing """ doc = fitz.open(pdf_path) results = []
for page_num in range(len(doc)): page = doc[page_num] text = page.get_text()
page_data = { "page": page_num + 1, "type": "unknown", "tables": [], "text": "", "needs_ocr": False }
# Detection phase if len(text.strip()) < 50: page_data["type"] = "image-based" page_data["needs_ocr"] = True else: page_data["type"] = "text-based" page_data["text"] = text
# Extraction phase using pdfplumber with pdfplumber.open(pdf_path) as pdf: plumb_page = pdf.pages[page_num] tables = plumb_page.extract_tables()
if tables: page_data["tables"] = clean_financial_tables(tables)
results.append(page_data)
return results
def clean_financial_tables(tables: list) -> list[dict]: """Clean and structure extracted financial tables.""" cleaned = []
for table in tables: if not table: continue
# Find header row (first non-empty row) header_idx = 0 for i, row in enumerate(table): if row and any(cell for cell in row): header_idx = i break
headers = table[header_idx] data_rows = table[header_idx + 1:]
for row in data_rows: if not row or not any(row): continue
row_dict = {} for i, cell in enumerate(row): if i < len(headers): key = headers[i] or f"col_{i}" row_dict[key.strip()] = parse_financial_value(cell)
if row_dict: cleaned.append(row_dict)
return cleaned
def parse_financial_value(value: Optional[str]) -> Optional[float | str]: """Parse financial values like '$1,234.56' or '(500.00)'.""" if not value: return None
value = str(value).strip()
# Handle negative numbers in parentheses if value.startswith("(") and value.endswith(")"): value = "-" + value[1:-1]
# Remove currency symbols and commas cleaned = value.replace("$", "").replace(",", "").strip()
try: return float(cleaned) except ValueError: return valueThis gives me the best of both worlds: PyMuPDF’s fast detection and pdfplumber’s accurate table extraction.
The Accuracy Boost: LLM Post-Processing
Even with the best extraction, you’ll get errors. I added an LLM post-processing pass:
import openai
def llm_validate_extraction(table_data: list[dict], context: str) -> list[dict]: """ Use LLM to validate and correct extraction errors.
Accuracy improvement: ~60% -> ~92% """ prompt = f""" Review this extracted financial table data for errors. Original context: {context}
Extracted data: {table_data}
Check for: 1. Misaligned columns 2. Missing values (should be 0 or N/A?) 3. Format inconsistencies 4. Calculation errors (if sum/total present)
Return corrected JSON array. """
response = openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0 )
return parse_llm_response(response.choices[0].message.content)The results speak for themselves:
Extraction Method AccuracyPure regex/heuristics ~60%pdfplumber alone ~78%pdfplumber + LLM validation ~92%What I Use Now
For production financial document processing:
def process_financial_pdf(pdf_path: str) -> dict: """ Complete financial PDF processing pipeline. """ # Step 1: Detect and extract raw_data = extract_financial_pdf(pdf_path)
# Step 2: Identify tables needing OCR ocr_pages = [p for p in raw_data if p["needs_ocr"]] if ocr_pages: # Send to OCR service (AWS Textract, Google Vision, etc.) ocr_results = send_to_ocr(pdf_path, ocr_pages) merge_ocr_results(raw_data, ocr_results)
# Step 3: Validate with LLM for page in raw_data: if page["tables"]: page["tables"] = llm_validate_extraction( page["tables"], page["text"][:1000] # Context from page )
return { "pages": raw_data, "summary": generate_summary(raw_data) }Key Takeaways
- Use pdfplumber for text-based financial PDFs - it handles column alignment better than Tabula
- Use PyMuPDF for detection - quickly identify image vs text pages
- Avoid Tabula/Camelot for borderless financial tables - they require too much manual tuning
- Add LLM validation - the accuracy jump from ~60% to ~92% is worth the API cost
- Process image-based pages separately - don’t try to force text extraction on scans
The right tool depends on your PDFs. For most financial documents, the pdfplumber + PyMuPDF + LLM combination gives you production-quality extraction.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments