How to Detect if a PDF Needs OCR or Text Extraction in Python
I was processing a batch of financial PDFs last week when my pipeline suddenly took 10x longer than usual. The culprit? I was running OCR on every single PDF, including the ones that were already text-based.
The problem was obvious in hindsight: I didn’t check if the PDF actually needed OCR before firing up Tesseract. Let me show you how to fix this.
The 10x Overhead Problem
Running OCR on a text-based PDF is wasteful. Here’s what I saw:
Processing text-based PDF (direct extraction): ~0.5 secondsProcessing text-based PDF (with unnecessary OCR): ~5+ secondsProcessing scanned PDF (OCR required): ~8-15 secondsWhen you’re processing thousands of documents, that 10x difference adds up fast. The solution is to detect the PDF type first, then route it to the appropriate processing path.
Method 1: pdfplumber Empty Text Check
The simplest approach I found is using pdfplumber to check if pages return empty text:
import pdfplumber
def needs_ocr(pdf_path: str) -> bool: """ Returns True if the PDF needs OCR (scanned/image-based), False if text can be extracted directly. """ with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: text = page.extract_text() if text and text.strip(): # Found selectable text, no OCR needed return False # No text found on any page, needs OCR return True
# Usagepdf_path = "financial_statement.pdf"if needs_ocr(pdf_path): print("This PDF needs OCR processing")else: print("This PDF has selectable text, extract directly")This approach is straightforward but has a catch: some PDFs have mixed pages—some scanned, some text-based. Let me refine it:
import pdfplumberfrom typing import List, Tuple
def analyze_pdf_pages(pdf_path: str) -> List[Tuple[int, bool]]: """ Analyze each page and return a list of (page_number, needs_ocr). """ results = [] with pdfplumber.open(pdf_path) as pdf: for i, page in enumerate(pdf.pages): text = page.extract_text() needs_ocr = not (text and text.strip()) results.append((i + 1, needs_ocr)) # 1-indexed return results
# Usagepdf_path = "mixed_document.pdf"page_analysis = analyze_pdf_pages(pdf_path)
for page_num, needs_ocr in page_analysis: status = "needs OCR" if needs_ocr else "has text" print(f"Page {page_num}: {status}")Method 2: PyMuPDF Image Detection
pdfplumber worked well, but I wanted a faster alternative. PyMuPDF (fitz) lets you check if pages are images:
import fitz # PyMuPDF
def needs_ocr_pymupdf(pdf_path: str) -> bool: """ Use PyMuPDF to detect if PDF needs OCR. Checks if first page has no selectable text. """ doc = fitz.open(pdf_path) first_page = doc[0]
# Check for text text = first_page.get_text() if text.strip(): return False # Has text, no OCR needed
# Check if page is an image images = first_page.get_images() if images: return True # Image-based page needs OCR
return False # No text and no images? Edge case
# Usagepdf_path = "scanned_invoice.pdf"if needs_ocr_pymupdf(pdf_path): print("Running OCR pipeline...")else: print("Extracting text directly...")I prefer this method for its speed. PyMuPDF is significantly faster than pdfplumber for the detection step:
import timeimport pdfplumberimport fitz
def benchmark_detection(pdf_path: str): # pdfplumber method start = time.time() with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: _ = page.extract_text() plumber_time = time.time() - start
# PyMuPDF method start = time.time() doc = fitz.open(pdf_path) for page in doc: _ = page.get_text() pymupdf_time = time.time() - start
print(f"pdfplumber: {plumber_time:.3f}s") print(f"PyMuPDF: {pymupdf_time:.3f}s") print(f"Speedup: {plumber_time/pymupdf_time:.1f}x")Production-Ready Detection with Caching
For my financial document pipeline, I combined both approaches and added caching to avoid re-processing the same files:
import fitzimport hashlibimport jsonfrom pathlib import Pathfrom typing import Optionalfrom functools import lru_cache
class PDFDetector: """ Detects if PDFs need OCR with caching for performance. """
def __init__(self, cache_file: str = ".pdf_detection_cache.json"): self.cache_file = Path(cache_file) self.cache = self._load_cache()
def _load_cache(self) -> dict: if self.cache_file.exists(): return json.loads(self.cache_file.read_text()) return {}
def _save_cache(self): self.cache_file.write_text(json.dumps(self.cache, indent=2))
def _file_hash(self, pdf_path: str) -> str: """Generate hash for cache key.""" return hashlib.md5(Path(pdf_path).read_bytes()).hexdigest()
def needs_ocr(self, pdf_path: str) -> bool: """ Determine if PDF needs OCR, using cache when possible. """ file_hash = self._file_hash(pdf_path)
# Check cache first if file_hash in self.cache: return self.cache[file_hash]
# Detect using PyMuPDF result = self._detect_ocr_need(pdf_path)
# Cache result self.cache[file_hash] = result self._save_cache()
return result
def _detect_ocr_need(self, pdf_path: str) -> bool: """Actual detection logic.""" doc = fitz.open(pdf_path)
for page in doc: text = page.get_text() if text.strip(): return False # Found text, no OCR needed
# No text found on any page return True
# Usagedetector = PDFDetector()
pdf_files = ["statement1.pdf", "invoice2.pdf", "report3.pdf"]for pdf in pdf_files: if detector.needs_ocr(pdf): print(f"{pdf}: needs OCR") else: print(f"{pdf}: extract directly")Choosing the Right OCR Engine
Once you’ve determined a PDF needs OCR, which engine should you use? I tested three options on financial documents:
import pytesseractfrom PIL import Imageimport fitz
# Option 1: pytesseract (classic)def ocr_with_tesseract(pdf_path: str) -> str: doc = fitz.open(pdf_path) text = [] for page in doc: pix = page.get_pixmap(dpi=300) img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) text.append(pytesseract.image_to_string(img)) return "\n".join(text)
# Option 2: docTR (better for dense tables)# pip install python-doctrfrom doctr.models import ocr_predictor
def ocr_with_doctr(pdf_path: str) -> str: model = ocr_predictor(pretrained=True) doc = fitz.open(pdf_path) results = [] for page in doc: pix = page.get_pixmap(dpi=150) img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) result = model([img]) results.append(result.export()) return results
# Option 3: Cloud (Mistral, Google Vision, etc.)# Use when accuracy is critical and budget allowsFor financial documents with dense number grids, docTR outperformed pytesseract significantly. The key insight: match your OCR engine to your document type.
Here’s my decision flow:
PDF Document | vCheck for text (pdfplumber/PyMuPDF) | +-- Has text? --> Extract directly (0.5s) | +-- No text? | v Is it financial/document-heavy? | +-- Yes --> docTR (better table handling) | +-- No --> pytesseract (faster for simple docs)The Complete Pipeline
Putting it all together:
import fitzfrom pathlib import Pathfrom typing import Optional, Callableimport pytesseractfrom PIL import Image
class SmartPDFProcessor: """ Processes PDFs with automatic OCR detection. """
def __init__(self, ocr_engine: Optional[Callable] = None): self.ocr_engine = ocr_engine or self._default_ocr
def _default_ocr(self, pdf_path: str) -> str: """Default OCR using pytesseract.""" doc = fitz.open(pdf_path) text = [] for page in doc: pix = page.get_pixmap(dpi=300) img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) text.append(pytesseract.image_to_string(img)) return "\n".join(text)
def _needs_ocr(self, pdf_path: str) -> bool: """Check if PDF has selectable text.""" doc = fitz.open(pdf_path) for page in doc: if page.get_text().strip(): return False return True
def process(self, pdf_path: str) -> str: """ Process PDF, automatically choosing extraction method. """ if self._needs_ocr(pdf_path): print(f"Running OCR on {pdf_path}...") return self.ocr_engine(pdf_path) else: print(f"Extracting text from {pdf_path}...") doc = fitz.open(pdf_path) return "\n".join(page.get_text() for page in doc)
# Usageprocessor = SmartPDFProcessor()
# This will automatically detect and use the right methodtext = processor.process("financial_report.pdf")Key Takeaways
- Always detect first - Running OCR on text-based PDFs wastes 10x time
- Use PyMuPDF for speed - Detection is faster than pdfplumber
- Cache detection results - Avoid re-checking the same files
- Match OCR to document type - docTR for financial docs, tesseract for simpler ones
The detection logic adds minimal overhead (milliseconds) but saves seconds per file. On my batch of 10,000 PDFs (70% text-based), this reduced total processing time from 14 hours to 2 hours.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments