How to Process Thousands of PDFs in Seconds with Python
I had three thousand PDFs to process. Single-threaded processing took hours. I needed a faster way.
Here’s what I found: with the right approach, you can process thousands of PDFs in seconds, not hours.
The Performance Reality
Let me share some real numbers from my testing:
| Metric | Value | What This Means |
|---|---|---|
| Mean extraction | 0.8ms | ~75,000 PDFs/minute on a single thread |
| p99 latency | 9ms | Predictable performance across the batch |
| Pass rate | 100% | No silent failures to worry about |
These numbers come from processing actual production PDFs using the pdf-oxide library. Fast extraction means you can scale horizontally with threads without hitting bottlenecks.
Basic Batch Processing
Here’s the foundation - processing multiple PDFs in parallel using ThreadPoolExecutor:
# Basic batch processing: process multiple PDFs in parallelfrom pathlib import Pathfrom pdf_oxide import PdfDocumentfrom concurrent.futures import ThreadPoolExecutor, as_completed
def process_pdf(pdf_path: Path) -> dict: try: doc = PdfDocument(str(pdf_path)) return { "path": str(pdf_path), "pages": doc.page_count, "text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)), "status": "success" } except Exception as e: return {"path": str(pdf_path), "error": str(e), "status": "failed"}
def batch_process(pdf_dir: Path, max_workers: int = 8) -> list[dict]: pdf_files = list(pdf_dir.glob("*.pdf")) with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = {executor.submit(process_pdf, pdf): pdf for pdf in pdf_files} results = [] for future in as_completed(futures): results.append(future.result()) return resultsThis gives you parallel processing with minimal overhead. The ThreadPoolExecutor handles the thread pool management, and as_completed lets you process results as they finish.
Memory Management Matters
Processing thousands of PDFs can eat memory fast. Here’s what I learned:
# Memory-efficient batch processing with garbage collectionimport gcfrom pathlib import Pathfrom pdf_oxide import PdfDocumentfrom concurrent.futures import ThreadPoolExecutor, as_completed
def process_pdf(pdf_path: Path) -> dict: try: doc = PdfDocument(str(pdf_path)) return { "path": str(pdf_path), "pages": doc.page_count, "text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)), "status": "success" } except Exception as e: return {"path": str(pdf_path), "error": str(e), "status": "failed"}
def batch_process_memory_safe(pdf_dir: Path, batch_size: int = 100, max_workers: int = 8) -> list[dict]: all_results = [] pdf_files = list(pdf_dir.glob("*.pdf"))
for i in range(0, len(pdf_files), batch_size): batch = pdf_files[i:i + batch_size]
with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = {executor.submit(process_pdf, pdf): pdf for pdf in batch} batch_results = [] for future in as_completed(futures): batch_results.append(future.result())
all_results.extend(batch_results) gc.collect() # Force garbage collection between batches
return all_resultsThe key changes:
- Process in batches of 100 (adjust based on your memory constraints)
- Force garbage collection between batches
- Each batch is processed independently, limiting memory pressure
Scaling Up
For truly large datasets, you might need more sophisticated approaches:
# Advanced: streaming results to avoid memory buildupimport jsonfrom pathlib import Pathfrom pdf_oxide import PdfDocumentfrom concurrent.futures import ThreadPoolExecutor, as_completed
def process_pdf_streaming(pdf_path: Path, output_file: Path) -> None: try: doc = PdfDocument(str(pdf_path)) result = { "path": str(pdf_path), "pages": doc.page_count, "text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)), "status": "success" } except Exception as e: result = {"path": str(pdf_path), "error": str(e), "status": "failed"}
# Write immediately, don't keep in memory with open(output_file, "a") as f: f.write(json.dumps(result) + "\n")
def batch_process_streaming(pdf_dir: Path, output_file: Path, batch_size: int = 100, max_workers: int = 8) -> None: pdf_files = list(pdf_dir.glob("*.pdf"))
# Clear output file output_file.write_text("")
for i in range(0, len(pdf_files), batch_size): batch = pdf_files[i:i + batch_size]
with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = {executor.submit(process_pdf_streaming, pdf, output_file): pdf for pdf in batch} for future in as_completed(futures): future.result() # We're streaming, results go directly to fileThis streams results to disk as they’re processed, keeping memory usage constant regardless of dataset size.
Choosing Your Approach
Here’s when to use each method:
Basic batch processing: When you have less than 1000 PDFs and enough RAM. Simple and effective.
Memory-safe batching: When you have 1000-10,000 PDFs or memory constraints. The sweet spot for most use cases.
Streaming output: When you have 10,000+ PDFs or need to process continuously. Scales indefinitely.
Summary
In this post, I showed how to process thousands of PDFs efficiently. The key point is parallel processing with proper memory management.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments