Best Python Library for Extracting Tables from Inconsistent Financial PDFs?
I stared at the PDF on my screen. Another annual report with tables that looked perfect visually but were a nightmare programmatically. No borders. Misaligned columns. Rows broken across pages. My tabula-py script returned garbage.
$ python extract_tables.py annual_report.pdf[] # Empty. Again.If you’ve tried extracting tables from financial PDFs, you know the pain. Let me walk through what actually works.
The Problem with PDF Tables
Here’s what most people don’t realize: there is no standard table construct in the PDF format. PDFs store positioning data, not structural data. A table you see on screen is just a bunch of text positioned at specific coordinates.
When I first tried extracting tables, I assumed libraries would “just work”:
import tabula
# This should work, right?tables = tabula.read_pdf("annual_report.pdf", pages="all")print(f"Found {len(tables)} tables")Found 0 tablesZero tables. But I could clearly see tables in the document!
Why Financial PDFs Are Hard
Financial documents have unique challenges:
+----------------------------------+----------------------------------------+| Issue | Example |+----------------------------------+----------------------------------------+| Invisible borders | Tables with no grid lines || Misaligned columns | Numbers shifted left/right || Multi-line rows | Long text wrapping to next line || Inconsistent spacing | Some cells have extra padding || Merged cells | Category headers spanning columns || Mixed content | Numbers, text, percentages together |+----------------------------------+----------------------------------------+I tested the major Python libraries. Here’s what I found.
Library 1: Tabula-py
import tabula
tables = tabula.read_pdf( "annual_report.pdf", pages="all", lattice=True, # For tables with visible borders)Result: Works great for tables with visible grid lines. Fails silently on the invisible-border tables that dominate annual reports.
Switching to lattice=False (stream mode) for borderless tables:
tables = tabula.read_pdf( "annual_report.pdf", pages="all", lattice=False, columns=[100, 200, 300, 400] # Need to guess column positions)Problem: You have to manually specify column positions. Not scalable for batch processing.
Library 2: Camelot
Camelot offers two modes: lattice (detects lines) and stream (detects whitespace).
import camelot
# Lattice mode for bordered tablestables = camelot.read_pdf( "annual_report.pdf", pages="1-5", flavor="lattice")print(f"Lattice accuracy: {tables[0].accuracy if tables else 0}%")Lattice accuracy: 0%Zero percent. Because my tables had no visible borders.
Trying stream mode:
tables = camelot.read_pdf( "annual_report.pdf", pages="1-5", flavor="stream")Result: Better, but still struggled with column alignment. Financial tables often have numbers that “bleed” into adjacent columns.
Library 3: pdfplumber (The Winner for Text-Based PDFs)
pdfplumber takes a different approach. It exposes the raw character-level positioning data, letting you build tables from the ground up.
import pdfplumber
with pdfplumber.open("annual_report.pdf") as pdf: page = pdf.pages[0] tables = page.extract_tables()
for table in tables: for row in table: print(row)['Revenue', '2024', '2023', 'Change']['Product Sales', '$1,234,567', '$1,100,000', '+12.2%']['Services', '$567,890', '$500,000', '+13.6%']Why pdfplumber wins: It handles column alignment better than any other library. Instead of guessing where columns should be, it uses the actual character positions.
For fine-grained control:
import pdfplumber
with pdfplumber.open("annual_report.pdf") as pdf: page = pdf.pages[0]
# Customize table extraction settings tables = page.extract_tables({ "vertical_strategy": "text", # Use text positioning "horizontal_strategy": "text", "snap_tolerance": 5, # Merge close vertical lines "join_tolerance": 5, # Merge close horizontal lines "edge_min_length": 3, # Minimum line length })The Real Solution: pdfplumber + LLM
For the messiest pages, even pdfplumber returned garbled data. Accuracy hovered around 60%.
I tried a hybrid approach: let pdfplumber do the heavy lifting, then use an LLM to clean up the results.
import pdfplumberfrom openai import OpenAI
client = OpenAI()
def extract_table_with_llm_cleanup(pdf_path, page_num): # Step 1: Extract with pdfplumber with pdfplumber.open(pdf_path) as pdf: page = pdf.pages[page_num] raw_text = page.extract_text() raw_tables = page.extract_tables()
# Step 2: Use LLM to structure the messy data prompt = f""" Extract the financial table from this PDF text. Return as JSON with columns: category, current_year, prior_year, change.
Raw text: {raw_text}
Raw table extraction (may have errors): {raw_tables} """
response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} )
return response.choices[0].message.contentResult: Accuracy jumped from 60% to 92% on messy pages.
The LLM handles:
- Merged cells that confuse traditional parsers
- Multi-line row content
- Missing/extra whitespace
- Inconsistent number formatting
Complete Working Example
Here’s my production-ready approach:
import pdfplumberimport jsonfrom typing import Optionalfrom openai import OpenAI
class FinancialTableExtractor: def __init__(self, use_llm: bool = False): self.use_llm = use_llm self.client = OpenAI() if use_llm else None
def extract(self, pdf_path: str, pages: str = "all") -> list[dict]: """Extract tables from a financial PDF.""" results = []
with pdfplumber.open(pdf_path) as pdf: page_nums = self._parse_pages(pages, len(pdf.pages))
for i in page_nums: page = pdf.pages[i] tables = self._extract_page(page, i) results.extend(tables)
return results
def _parse_pages(self, pages: str, max_pages: int) -> list[int]: """Parse page specification like '1-5, 7, 10'.""" if pages == "all": return list(range(max_pages))
page_nums = [] for part in pages.split(","): part = part.strip() if "-" in part: start, end = map(int, part.split("-")) page_nums.extend(range(start - 1, end)) else: page_nums.append(int(part) - 1) return page_nums
def _extract_page(self, page, page_num: int) -> list[dict]: """Extract tables from a single page.""" # Try pdfplumber first raw_tables = page.extract_tables({ "vertical_strategy": "text", "horizontal_strategy": "text", "snap_tolerance": 5, })
if not raw_tables: return []
tables = [] for i, table in enumerate(raw_tables): if self._is_table_valid(table): if self.use_llm: table = self._llm_cleanup(table, page.extract_text()) tables.append({ "page": page_num + 1, "table_index": i, "data": table })
return tables
def _is_table_valid(self, table: list) -> bool: """Check if extracted table has meaningful content.""" if not table or len(table) < 2: return False
# Must have at least 2 columns and 2 rows if len(table[0]) < 2: return False
# At least half the cells should have content total_cells = sum(len(row) for row in table) non_empty = sum( 1 for row in table for cell in row if cell and cell.strip() )
return non_empty / total_cells > 0.5 if total_cells > 0 else False
def _llm_cleanup(self, table: list, context: str) -> list[list]: """Use LLM to fix table extraction errors.""" prompt = f""" Clean up this extracted financial table. Fix any merged cells, misaligned columns, or data issues. Return only the corrected table as a JSON array of arrays.
Context from page: {context[:2000]}
Extracted table (may have errors): {json.dumps(table)} """
response = self.client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} )
result = json.loads(response.choices[0].message.content) return result.get("table", table)
# Usageif __name__ == "__main__": # Without LLM (faster, good for clean PDFs) extractor = FinancialTableExtractor(use_llm=False) tables = extractor.extract("annual_report.pdf", pages="1-10")
# With LLM (slower, better for messy PDFs) extractor_llm = FinancialTableExtractor(use_llm=True) tables = extractor_llm.extract("messy_annual_report.pdf", pages="5-8")
print(json.dumps(tables, indent=2))Decision Matrix
+-------------------------+-------------------+-------------------+-------------------+| Scenario | pdfplumber alone | pdfplumber + LLM | OCR (pytesseract) |+-------------------------+-------------------+-------------------+-------------------+| Clean tables, borders | Excellent | Overkill | Don't bother || Invisible borders | Good | Excellent | Poor || Misaligned columns | Good | Excellent | Poor || Scanned PDF (images) | Won't work | Won't work | Required || Batch processing | Fast | Slow/costly | Slow || Accuracy requirement | 70-85% | 90%+ | 60-80% |+-------------------------+-------------------+-------------------+-------------------+Key Takeaways
-
Start with pdfplumber for text-based financial PDFs. It handles column alignment better than tabula or camelot.
-
Lattice/stream modes in Camelot and Tabula fail on borderless tables, which are common in annual reports.
-
For messy pages, add an LLM cleanup step. It bumps accuracy from ~60% to ~90%+.
-
For scanned PDFs, you need OCR (pytesseract + cv2 for preprocessing) before any table extraction.
-
There’s no silver bullet. The “best” library depends on your PDFs’ specific quirks.
pdfplumber>=0.10.0openai>=1.0.0tabula-py>=2.9.0 # Optional, for bordered tablescamelot-py>=0.11.0 # Optional, alternative approachFinal Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments