How to Build a Python Pipeline for Financial PDF Data Extraction?
Problem
I spent two weeks building a financial PDF extraction system, and it was burning through my LLM budget. Every annual report is 80+ pages, but I only need 4-6 pages of actual financial data. My naive approach sent every page to the LLM:
# This was my first attempt - expensive and error-pronefor page in pdf.pages: text = page.extract_text() result = llm.extract(text) # Sending 80 pages to LLM! # Token costs: 80 pages * ~2000 tokens * $0.01/1K tokens = $1.60 per documentThe results were terrible. The LLM got confused by table of contents, legal disclaimers, and marketing fluff. I was extracting irrelevant data from 74 pages that didnβt matter.
Environment
- Python 3.11
- pdfplumber for page extraction
- scikit-learn for TF-IDF classification
- SQLite for intermediate storage
- OpenAI API for structured extraction
Solution
I restructured the pipeline to classify pages before extraction. Only relevant pages get sent to the LLM:
Stage 1: Extract all pages to SQLiteStage 2: Classify each page (balance sheet, income statement, notes, other)Stage 3: Extract data ONLY from relevant pagesStage 4: Validate and standardize outputToken costs dropped from $1.60 to $0.12 per document. Extraction accuracy improved because the LLM focused on relevant content.
Stage 1: Page Extraction and Storage
I started by extracting all pages into SQLite for processing:
import pdfplumberimport sqlite3from dataclasses import dataclassfrom pathlib import Path
@dataclassclass ExtractedPage: page_number: int text: str numeric_density: float has_tables: bool
class PageExtractor: def __init__(self, db_path: str = "pages.db"): self.conn = sqlite3.connect(db_path) self._setup_tables()
def _setup_tables(self): self.conn.execute(""" CREATE TABLE IF NOT EXISTS pages ( id INTEGER PRIMARY KEY, document_id TEXT, page_number INTEGER, text TEXT, numeric_density REAL, has_tables BOOLEAN, classification TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) """) self.conn.commit()
def extract_pdf(self, pdf_path: str, document_id: str) -> list[ExtractedPage]: """Extract all pages from PDF and store in SQLite""" pages = []
with pdfplumber.open(pdf_path) as pdf: for i, page in enumerate(pdf.pages): text = page.extract_text() or ""
# Calculate numeric density for later classification numeric_chars = sum(c.isdigit() or c in '.-+$%,' for c in text) numeric_density = numeric_chars / len(text) if text else 0
# Check for tables tables = page.extract_tables() has_tables = len(tables) > 0
page_data = ExtractedPage( page_number=i + 1, text=text, numeric_density=numeric_density, has_tables=has_tables )
# Store in SQLite self.conn.execute(""" INSERT INTO pages (document_id, page_number, text, numeric_density, has_tables) VALUES (?, ?, ?, ?, ?) """, (document_id, i + 1, text, numeric_density, has_tables))
pages.append(page_data)
self.conn.commit() return pages
def get_pages(self, document_id: str) -> list[ExtractedPage]: """Retrieve pages from SQLite""" cursor = self.conn.execute(""" SELECT page_number, text, numeric_density, has_tables FROM pages WHERE document_id = ? ORDER BY page_number """, (document_id,))
return [ExtractedPage( page_number=row[0], text=row[1], numeric_density=row[2], has_tables=row[3] ) for row in cursor]I tested this with a sample annual report:
# Extract pagesextractor = PageExtractor()pages = extractor.extract_pdf("annual_report_2025.pdf", "doc-001")
print(f"Extracted {len(pages)} pages")# Output: Extracted 82 pages
# Check numeric density distributionfor page in pages[:5]: print(f"Page {page.page_number}: density={page.numeric_density:.2f}, tables={page.has_tables}")# Output:# Page 1: density=0.05, tables=False (cover page)# Page 2: density=0.08, tables=False (TOC)# Page 3: density=0.32, tables=True (balance sheet!)# Page 4: density=0.28, tables=True (income statement!)Stage 2: Page Classification
I built a TF-IDF classifier trained on just 50 labeled pages:
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelineimport picklefrom pathlib import Path
# Classification labelsCLASSES = [ "balance_sheet", "income_statement", "cash_flow", "notes", "other"]
class PageClassifier: def __init__(self, model_path: str = "classifier.pkl"): self.model_path = model_path self.pipeline = self._build_pipeline()
if Path(model_path).exists(): self.load_model() else: self.is_trained = False
def _build_pipeline(self) -> Pipeline: """Build TF-IDF + Naive Bayes pipeline""" return Pipeline([ ('tfidf', TfidfVectorizer( max_features=500, ngram_range=(1, 2), stop_words='english' )), ('classifier', MultinomialNB()) ])
def train(self, pages: list[tuple[str, str]]): """Train classifier on labeled pages Args: pages: List of (text, label) tuples """ texts = [p[0] for p in pages] labels = [p[1] for p in pages]
self.pipeline.fit(texts, labels) self.is_trained = True
# Save model with open(self.model_path, 'wb') as f: pickle.dump(self.pipeline, f)
def classify(self, text: str) -> tuple[str, float]: """Classify a single page Returns: (label, confidence) """ if not self.is_trained: raise ValueError("Classifier not trained. Call train() first.")
# Get prediction and probabilities probs = self.pipeline.predict_proba([text])[0] label_idx = probs.argmax()
return self.pipeline.classes_[label_idx], probs[label_idx]
def classify_with_keywords(self, page: ExtractedPage) -> str: """Fallback classification using keyword scoring""" text_lower = page.text.lower()
# Balance sheet keywords balance_keywords = [ 'total assets', 'total liabilities', 'stockholders equity', 'current assets', 'current liabilities', 'balance sheet' ]
# Income statement keywords income_keywords = [ 'revenue', 'net income', 'operating income', 'earnings per share', 'gross profit', 'income statement', 'statement of operations' ]
# Cash flow keywords cashflow_keywords = [ 'cash flows', 'operating activities', 'investing activities', 'financing activities', 'statement of cash flows' ]
# Score each category balance_score = sum(1 for kw in balance_keywords if kw in text_lower) income_score = sum(1 for kw in income_keywords if kw in text_lower) cashflow_score = sum(1 for kw in cashflow_keywords if kw in text_lower)
# Add bonus for numeric density if page.numeric_density > 0.25: balance_score += 2 income_score += 2
# Add bonus for tables if page.has_tables: balance_score += 1 income_score += 1
# Return highest score scores = { 'balance_sheet': balance_score, 'income_statement': income_score, 'cash_flow': cashflow_score, 'other': 1 # baseline }
return max(scores, key=scores.get)
def classify_page(self, page: ExtractedPage) -> str: """Classify page using model or keyword fallback""" if self.is_trained: label, confidence = self.classify(page.text) if confidence > 0.7: return label
# Fallback to keyword classification return self.classify_with_keywords(page)I trained the classifier on 50 manually labeled pages:
# Training data (50 pages manually labeled)training_pages = [ ("Total Assets $1,234,567 Total Liabilities $890,123...", "balance_sheet"), ("Revenue $2,000,000 Cost of Goods Sold $1,200,000...", "income_statement"), ("Cash from operating activities $500,000...", "cash_flow"), # ... 47 more labeled pages]
classifier = PageClassifier()classifier.train(training_pages)
# Test on new pagestest_pages = extractor.get_pages("doc-001")for page in test_pages[:5]: label = classifier.classify_page(page) print(f"Page {page.page_number}: {label}")
# Output:# Page 1: other# Page 2: other# Page 3: balance_sheet# Page 4: income_statement# Page 5: notesStage 3: Targeted LLM Extraction
Only relevant pages go to the LLM:
from openai import OpenAIfrom pydantic import BaseModelfrom typing import Optionalimport json
class BalanceSheet(BaseModel): total_assets: Optional[float] current_assets: Optional[float] total_liabilities: Optional[float] current_liabilities: Optional[float] stockholders_equity: Optional[float]
class IncomeStatement(BaseModel): revenue: Optional[float] cost_of_goods_sold: Optional[float] gross_profit: Optional[float] operating_income: Optional[float] net_income: Optional[float] earnings_per_share: Optional[float]
class FinancialExtractor: def __init__(self, api_key: str): self.client = OpenAI(api_key=api_key)
def extract_balance_sheet(self, text: str) -> BalanceSheet: """Extract balance sheet data from page text""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "Extract balance sheet data. Return JSON with keys: total_assets, current_assets, total_liabilities, current_liabilities, stockholders_equity. Use null for missing values." }, { "role": "user", "content": text } ], response_format={"type": "json_object"} )
data = json.loads(response.choices[0].message.content) return BalanceSheet(**data)
def extract_income_statement(self, text: str) -> IncomeStatement: """Extract income statement data from page text""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "Extract income statement data. Return JSON with keys: revenue, cost_of_goods_sold, gross_profit, operating_income, net_income, earnings_per_share. Use null for missing values." }, { "role": "user", "content": text } ], response_format={"type": "json_object"} )
data = json.loads(response.choices[0].message.content) return IncomeStatement(**data)
def extract_from_document( self, pages: list[ExtractedPage], classifier: PageClassifier ) -> dict: """Extract financial data from classified pages""" results = { 'balance_sheets': [], 'income_statements': [] }
for page in pages: label = classifier.classify_page(page)
if label == 'balance_sheet': data = self.extract_balance_sheet(page.text) results['balance_sheets'].append({ 'page': page.page_number, 'data': data.model_dump() })
elif label == 'income_statement': data = self.extract_income_statement(page.text) results['income_statements'].append({ 'page': page.page_number, 'data': data.model_dump() })
return resultsI tested the complete extraction:
# Full pipelineextractor = PageExtractor()classifier = PageClassifier(model_path="trained_classifier.pkl")financial_extractor = FinancialExtractor(api_key="your-key")
# Stage 1: Extract pagespages = extractor.extract_pdf("annual_report.pdf", "doc-002")
# Stage 2 & 3: Classify and extractresults = financial_extractor.extract_from_document(pages, classifier)
print(f"Found {len(results['balance_sheets'])} balance sheets")print(f"Found {len(results['income_statements'])} income statements")
# Output:# Found 2 balance sheets# Found 2 income statementsStage 4: Validation and Standardization
I added validation to catch extraction errors:
from dataclasses import dataclassfrom typing import Optional
@dataclassclass ValidationResult: is_valid: bool errors: list[str] warnings: list[str]
class FinancialValidator: """Validate extracted financial data"""
def validate_balance_sheet(self, data: BalanceSheet) -> ValidationResult: errors = [] warnings = []
# Check accounting equation: Assets = Liabilities + Equity if data.total_assets and data.total_liabilities and data.stockholders_equity: calculated_equity = data.total_assets - data.total_liabilities if abs(calculated_equity - data.stockholders_equity) > 0.01 * data.total_assets: errors.append( f"Accounting equation violated: " f"Assets ({data.total_assets}) != " f"Liabilities ({data.total_liabilities}) + " f"Equity ({data.stockholders_equity})" )
# Check for negative values if data.total_assets and data.total_assets < 0: errors.append("Total assets cannot be negative")
# Check for missing critical fields if not data.total_assets: warnings.append("Missing total assets")
return ValidationResult( is_valid=len(errors) == 0, errors=errors, warnings=warnings )
def validate_income_statement(self, data: IncomeStatement) -> ValidationResult: errors = [] warnings = []
# Check revenue relationships if data.revenue and data.gross_profit: expected_cogs = data.revenue - data.gross_profit if data.cost_of_goods_sold: if abs(expected_cogs - data.cost_of_goods_sold) > 0.01 * data.revenue: errors.append( f"Revenue - Gross Profit should equal COGS: " f"{data.revenue} - {data.gross_profit} = {expected_cogs} " f"(got {data.cost_of_goods_sold})" )
# Check for reasonable margins if data.revenue and data.net_income: margin = data.net_income / data.revenue if margin < -1 or margin > 1: warnings.append(f"Unusual net margin: {margin:.1%}")
return ValidationResult( is_valid=len(errors) == 0, errors=errors, warnings=warnings )I tested validation:
validator = FinancialValidator()
# Test with extracted databalance_sheet = BalanceSheet( total_assets=1000000, current_assets=500000, total_liabilities=600000, current_liabilities=300000, stockholders_equity=400000 # Wrong! Should be 400000)
result = validator.validate_balance_sheet(balance_sheet)print(f"Valid: {result.is_valid}")print(f"Errors: {result.errors}")
# Output:# Valid: False# Errors: ["Accounting equation violated: Assets (1000000) != Liabilities (600000) + Equity (400000)"]Complete Pipeline Orchestration
I wrapped everything into a unified pipeline:
from dataclasses import dataclassfrom pathlib import Path
@dataclassclass PipelineResult: document_id: str balance_sheets: list[dict] income_statements: list[dict] validation_errors: list[str] pages_processed: int pages_extracted: int token_usage: int
class FinancialPDFPipeline: def __init__(self, api_key: str, db_path: str = "pipeline.db"): self.extractor = PageExtractor(db_path) self.classifier = PageClassifier() self.llm_extractor = FinancialExtractor(api_key) self.validator = FinancialValidator()
def process(self, pdf_path: str) -> PipelineResult: """Run complete extraction pipeline""" document_id = Path(pdf_path).stem
# Stage 1: Extract pages pages = self.extractor.extract_pdf(pdf_path, document_id)
# Stage 2 & 3: Classify and extract results = self.llm_extractor.extract_from_document( pages, self.classifier )
# Stage 4: Validate validation_errors = [] for bs in results['balance_sheets']: data = BalanceSheet(**bs['data']) result = self.validator.validate_balance_sheet(data) if not result.is_valid: validation_errors.extend(result.errors)
for inc in results['income_statements']: data = IncomeStatement(**inc['data']) result = self.validator.validate_income_statement(data) if not result.is_valid: validation_errors.extend(result.errors)
return PipelineResult( document_id=document_id, balance_sheets=results['balance_sheets'], income_statements=results['income_statements'], validation_errors=validation_errors, pages_processed=len(pages), pages_extracted=len(results['balance_sheets']) + len(results['income_statements']), token_usage=0 # Track from API response )
# Usagepipeline = FinancialPDFPipeline(api_key="your-key")result = pipeline.process("annual_report_2025.pdf")
print(f"Processed {result.pages_processed} pages")print(f"Extracted from {result.pages_extracted} relevant pages")print(f"Found {len(result.balance_sheets)} balance sheets")print(f"Found {len(result.income_statements)} income statements")print(f"Validation errors: {len(result.validation_errors)}")
# Output:# Processed 82 pages# Extracted from 4 relevant pages# Found 2 balance sheets# Found 2 income statements# Validation errors: 0Cost Comparison
I measured the cost difference:
Before classification:- 82 pages sent to LLM- ~164,000 tokens per document- Cost: $1.64 per document (GPT-4o-mini at $0.01/1K tokens)
After classification:- 4 pages sent to LLM (only relevant pages)- ~8,000 tokens per document- Cost: $0.08 per document
Savings: 95% reduction in token costsSummary
In this post, I showed how to build a Python pipeline for financial PDF data extraction. The key insight is the classify-then-extract pattern: first classify each page using a simple TF-IDF classifier (trained on just 50 labeled pages), then only send relevant pages to the LLM. This reduced my token costs by 95% and improved extraction accuracy by eliminating noise from irrelevant pages. The four-stage pipeline (extract, classify, extract, validate) provides a maintainable architecture that can scale to thousands of documents.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- π¨βπ» pdfplumber Documentation
- π¨βπ» PyMuPDF Documentation
- π¨βπ» LangChain PDF Loading
- π¨βπ» scikit-learn TF-IDF
Oh, and if you found these resources useful, donβt forget to support me by starring the repo on GitHub!
Comments