Skip to content

How to Process Thousands of PDFs in Seconds with Python

I had three thousand PDFs to process. Single-threaded processing took hours. I needed a faster way.

Here’s what I found: with the right approach, you can process thousands of PDFs in seconds, not hours.

The Performance Reality

Let me share some real numbers from my testing:

MetricValueWhat This Means
Mean extraction0.8ms~75,000 PDFs/minute on a single thread
p99 latency9msPredictable performance across the batch
Pass rate100%No silent failures to worry about

These numbers come from processing actual production PDFs using the pdf-oxide library. Fast extraction means you can scale horizontally with threads without hitting bottlenecks.

Basic Batch Processing

Here’s the foundation - processing multiple PDFs in parallel using ThreadPoolExecutor:

batch_basic.py
# Basic batch processing: process multiple PDFs in parallel
from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_pdf(pdf_path: Path) -> dict:
try:
doc = PdfDocument(str(pdf_path))
return {
"path": str(pdf_path),
"pages": doc.page_count,
"text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)),
"status": "success"
}
except Exception as e:
return {"path": str(pdf_path), "error": str(e), "status": "failed"}
def batch_process(pdf_dir: Path, max_workers: int = 8) -> list[dict]:
pdf_files = list(pdf_dir.glob("*.pdf"))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(process_pdf, pdf): pdf for pdf in pdf_files}
results = []
for future in as_completed(futures):
results.append(future.result())
return results

This gives you parallel processing with minimal overhead. The ThreadPoolExecutor handles the thread pool management, and as_completed lets you process results as they finish.

Memory Management Matters

Processing thousands of PDFs can eat memory fast. Here’s what I learned:

batch_memory_safe.py
# Memory-efficient batch processing with garbage collection
import gc
from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_pdf(pdf_path: Path) -> dict:
try:
doc = PdfDocument(str(pdf_path))
return {
"path": str(pdf_path),
"pages": doc.page_count,
"text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)),
"status": "success"
}
except Exception as e:
return {"path": str(pdf_path), "error": str(e), "status": "failed"}
def batch_process_memory_safe(pdf_dir: Path, batch_size: int = 100, max_workers: int = 8) -> list[dict]:
all_results = []
pdf_files = list(pdf_dir.glob("*.pdf"))
for i in range(0, len(pdf_files), batch_size):
batch = pdf_files[i:i + batch_size]
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(process_pdf, pdf): pdf for pdf in batch}
batch_results = []
for future in as_completed(futures):
batch_results.append(future.result())
all_results.extend(batch_results)
gc.collect() # Force garbage collection between batches
return all_results

The key changes:

  • Process in batches of 100 (adjust based on your memory constraints)
  • Force garbage collection between batches
  • Each batch is processed independently, limiting memory pressure

Scaling Up

For truly large datasets, you might need more sophisticated approaches:

batch_streaming.py
# Advanced: streaming results to avoid memory buildup
import json
from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_pdf_streaming(pdf_path: Path, output_file: Path) -> None:
try:
doc = PdfDocument(str(pdf_path))
result = {
"path": str(pdf_path),
"pages": doc.page_count,
"text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)),
"status": "success"
}
except Exception as e:
result = {"path": str(pdf_path), "error": str(e), "status": "failed"}
# Write immediately, don't keep in memory
with open(output_file, "a") as f:
f.write(json.dumps(result) + "\n")
def batch_process_streaming(pdf_dir: Path, output_file: Path, batch_size: int = 100, max_workers: int = 8) -> None:
pdf_files = list(pdf_dir.glob("*.pdf"))
# Clear output file
output_file.write_text("")
for i in range(0, len(pdf_files), batch_size):
batch = pdf_files[i:i + batch_size]
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(process_pdf_streaming, pdf, output_file): pdf for pdf in batch}
for future in as_completed(futures):
future.result() # We're streaming, results go directly to file

This streams results to disk as they’re processed, keeping memory usage constant regardless of dataset size.

Choosing Your Approach

Here’s when to use each method:

Basic batch processing: When you have less than 1000 PDFs and enough RAM. Simple and effective.

Memory-safe batching: When you have 1000-10,000 PDFs or memory constraints. The sweet spot for most use cases.

Streaming output: When you have 10,000+ PDFs or need to process continuously. Scales indefinitely.

Summary

In this post, I showed how to process thousands of PDFs efficiently. The key point is parallel processing with proper memory management.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments