Skip to content

How to Detect if a PDF Needs OCR or Text Extraction in Python

I was processing a batch of financial PDFs last week when my pipeline suddenly took 10x longer than usual. The culprit? I was running OCR on every single PDF, including the ones that were already text-based.

The problem was obvious in hindsight: I didn’t check if the PDF actually needed OCR before firing up Tesseract. Let me show you how to fix this.

The 10x Overhead Problem

Running OCR on a text-based PDF is wasteful. Here’s what I saw:

Processing text-based PDF (direct extraction): ~0.5 seconds
Processing text-based PDF (with unnecessary OCR): ~5+ seconds
Processing scanned PDF (OCR required): ~8-15 seconds

When you’re processing thousands of documents, that 10x difference adds up fast. The solution is to detect the PDF type first, then route it to the appropriate processing path.

Method 1: pdfplumber Empty Text Check

The simplest approach I found is using pdfplumber to check if pages return empty text:

detect_pdf_type_pdfplumber.py
import pdfplumber
def needs_ocr(pdf_path: str) -> bool:
"""
Returns True if the PDF needs OCR (scanned/image-based),
False if text can be extracted directly.
"""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text = page.extract_text()
if text and text.strip():
# Found selectable text, no OCR needed
return False
# No text found on any page, needs OCR
return True
# Usage
pdf_path = "financial_statement.pdf"
if needs_ocr(pdf_path):
print("This PDF needs OCR processing")
else:
print("This PDF has selectable text, extract directly")

This approach is straightforward but has a catch: some PDFs have mixed pages—some scanned, some text-based. Let me refine it:

detect_mixed_pdf.py
import pdfplumber
from typing import List, Tuple
def analyze_pdf_pages(pdf_path: str) -> List[Tuple[int, bool]]:
"""
Analyze each page and return a list of (page_number, needs_ocr).
"""
results = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text()
needs_ocr = not (text and text.strip())
results.append((i + 1, needs_ocr)) # 1-indexed
return results
# Usage
pdf_path = "mixed_document.pdf"
page_analysis = analyze_pdf_pages(pdf_path)
for page_num, needs_ocr in page_analysis:
status = "needs OCR" if needs_ocr else "has text"
print(f"Page {page_num}: {status}")

Method 2: PyMuPDF Image Detection

pdfplumber worked well, but I wanted a faster alternative. PyMuPDF (fitz) lets you check if pages are images:

detect_pdf_type_pymupdf.py
import fitz # PyMuPDF
def needs_ocr_pymupdf(pdf_path: str) -> bool:
"""
Use PyMuPDF to detect if PDF needs OCR.
Checks if first page has no selectable text.
"""
doc = fitz.open(pdf_path)
first_page = doc[0]
# Check for text
text = first_page.get_text()
if text.strip():
return False # Has text, no OCR needed
# Check if page is an image
images = first_page.get_images()
if images:
return True # Image-based page needs OCR
return False # No text and no images? Edge case
# Usage
pdf_path = "scanned_invoice.pdf"
if needs_ocr_pymupdf(pdf_path):
print("Running OCR pipeline...")
else:
print("Extracting text directly...")

I prefer this method for its speed. PyMuPDF is significantly faster than pdfplumber for the detection step:

benchmark_detection.py
import time
import pdfplumber
import fitz
def benchmark_detection(pdf_path: str):
# pdfplumber method
start = time.time()
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
_ = page.extract_text()
plumber_time = time.time() - start
# PyMuPDF method
start = time.time()
doc = fitz.open(pdf_path)
for page in doc:
_ = page.get_text()
pymupdf_time = time.time() - start
print(f"pdfplumber: {plumber_time:.3f}s")
print(f"PyMuPDF: {pymupdf_time:.3f}s")
print(f"Speedup: {plumber_time/pymupdf_time:.1f}x")

Production-Ready Detection with Caching

For my financial document pipeline, I combined both approaches and added caching to avoid re-processing the same files:

production_pdf_detector.py
import fitz
import hashlib
import json
from pathlib import Path
from typing import Optional
from functools import lru_cache
class PDFDetector:
"""
Detects if PDFs need OCR with caching for performance.
"""
def __init__(self, cache_file: str = ".pdf_detection_cache.json"):
self.cache_file = Path(cache_file)
self.cache = self._load_cache()
def _load_cache(self) -> dict:
if self.cache_file.exists():
return json.loads(self.cache_file.read_text())
return {}
def _save_cache(self):
self.cache_file.write_text(json.dumps(self.cache, indent=2))
def _file_hash(self, pdf_path: str) -> str:
"""Generate hash for cache key."""
return hashlib.md5(Path(pdf_path).read_bytes()).hexdigest()
def needs_ocr(self, pdf_path: str) -> bool:
"""
Determine if PDF needs OCR, using cache when possible.
"""
file_hash = self._file_hash(pdf_path)
# Check cache first
if file_hash in self.cache:
return self.cache[file_hash]
# Detect using PyMuPDF
result = self._detect_ocr_need(pdf_path)
# Cache result
self.cache[file_hash] = result
self._save_cache()
return result
def _detect_ocr_need(self, pdf_path: str) -> bool:
"""Actual detection logic."""
doc = fitz.open(pdf_path)
for page in doc:
text = page.get_text()
if text.strip():
return False # Found text, no OCR needed
# No text found on any page
return True
# Usage
detector = PDFDetector()
pdf_files = ["statement1.pdf", "invoice2.pdf", "report3.pdf"]
for pdf in pdf_files:
if detector.needs_ocr(pdf):
print(f"{pdf}: needs OCR")
else:
print(f"{pdf}: extract directly")

Choosing the Right OCR Engine

Once you’ve determined a PDF needs OCR, which engine should you use? I tested three options on financial documents:

ocr_engine_comparison.py
import pytesseract
from PIL import Image
import fitz
# Option 1: pytesseract (classic)
def ocr_with_tesseract(pdf_path: str) -> str:
doc = fitz.open(pdf_path)
text = []
for page in doc:
pix = page.get_pixmap(dpi=300)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
text.append(pytesseract.image_to_string(img))
return "\n".join(text)
# Option 2: docTR (better for dense tables)
# pip install python-doctr
from doctr.models import ocr_predictor
def ocr_with_doctr(pdf_path: str) -> str:
model = ocr_predictor(pretrained=True)
doc = fitz.open(pdf_path)
results = []
for page in doc:
pix = page.get_pixmap(dpi=150)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
result = model([img])
results.append(result.export())
return results
# Option 3: Cloud (Mistral, Google Vision, etc.)
# Use when accuracy is critical and budget allows

For financial documents with dense number grids, docTR outperformed pytesseract significantly. The key insight: match your OCR engine to your document type.

Here’s my decision flow:

PDF Document
|
v
Check for text (pdfplumber/PyMuPDF)
|
+-- Has text? --> Extract directly (0.5s)
|
+-- No text?
|
v
Is it financial/document-heavy?
|
+-- Yes --> docTR (better table handling)
|
+-- No --> pytesseract (faster for simple docs)

The Complete Pipeline

Putting it all together:

smart_pdf_processor.py
import fitz
from pathlib import Path
from typing import Optional, Callable
import pytesseract
from PIL import Image
class SmartPDFProcessor:
"""
Processes PDFs with automatic OCR detection.
"""
def __init__(self, ocr_engine: Optional[Callable] = None):
self.ocr_engine = ocr_engine or self._default_ocr
def _default_ocr(self, pdf_path: str) -> str:
"""Default OCR using pytesseract."""
doc = fitz.open(pdf_path)
text = []
for page in doc:
pix = page.get_pixmap(dpi=300)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
text.append(pytesseract.image_to_string(img))
return "\n".join(text)
def _needs_ocr(self, pdf_path: str) -> bool:
"""Check if PDF has selectable text."""
doc = fitz.open(pdf_path)
for page in doc:
if page.get_text().strip():
return False
return True
def process(self, pdf_path: str) -> str:
"""
Process PDF, automatically choosing extraction method.
"""
if self._needs_ocr(pdf_path):
print(f"Running OCR on {pdf_path}...")
return self.ocr_engine(pdf_path)
else:
print(f"Extracting text from {pdf_path}...")
doc = fitz.open(pdf_path)
return "\n".join(page.get_text() for page in doc)
# Usage
processor = SmartPDFProcessor()
# This will automatically detect and use the right method
text = processor.process("financial_report.pdf")

Key Takeaways

  1. Always detect first - Running OCR on text-based PDFs wastes 10x time
  2. Use PyMuPDF for speed - Detection is faster than pdfplumber
  3. Cache detection results - Avoid re-checking the same files
  4. Match OCR to document type - docTR for financial docs, tesseract for simpler ones

The detection logic adds minimal overhead (milliseconds) but saves seconds per file. On my batch of 10,000 PDFs (70% text-based), this reduced total processing time from 14 hours to 2 hours.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments