How to Process Thousands of PDFs in Seconds with Python

Mar 4, 2026

I had three thousand PDFs to process. Single-threaded processing took hours. I needed a faster way.

Here’s what I found: with the right approach, you can process thousands of PDFs in seconds, not hours.

The Performance Reality

Let me share some real numbers from my testing:

Metric	Value	What This Means
Mean extraction	0.8ms	~75,000 PDFs/minute on a single thread
p99 latency	9ms	Predictable performance across the batch
Pass rate	100%	No silent failures to worry about

These numbers come from processing actual production PDFs using the pdf-oxide library. Fast extraction means you can scale horizontally with threads without hitting bottlenecks.

Basic Batch Processing

Here’s the foundation - processing multiple PDFs in parallel using ThreadPoolExecutor:

# Basic batch processing: process multiple PDFs in parallel
from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor, as_completed

def process_pdf(pdf_path: Path) -> dict:
    try:
        doc = PdfDocument(str(pdf_path))
        return {
            "path": str(pdf_path),
            "pages": doc.page_count,
            "text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)),
            "status": "success"
        }
    except Exception as e:
        return {"path": str(pdf_path), "error": str(e), "status": "failed"}

def batch_process(pdf_dir: Path, max_workers: int = 8) -> list[dict]:
    pdf_files = list(pdf_dir.glob("*.pdf"))
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(process_pdf, pdf): pdf for pdf in pdf_files}
        results = []
        for future in as_completed(futures):
            results.append(future.result())
    return results

This gives you parallel processing with minimal overhead. The ThreadPoolExecutor handles the thread pool management, and as_completed lets you process results as they finish.

Memory Management Matters

Processing thousands of PDFs can eat memory fast. Here’s what I learned:

# Memory-efficient batch processing with garbage collection
import gc
from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor, as_completed

def process_pdf(pdf_path: Path) -> dict:
    try:
        doc = PdfDocument(str(pdf_path))
        return {
            "path": str(pdf_path),
            "pages": doc.page_count,
            "text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)),
            "status": "success"
        }
    except Exception as e:
        return {"path": str(pdf_path), "error": str(e), "status": "failed"}

def batch_process_memory_safe(pdf_dir: Path, batch_size: int = 100, max_workers: int = 8) -> list[dict]:
    all_results = []
    pdf_files = list(pdf_dir.glob("*.pdf"))

    for i in range(0, len(pdf_files), batch_size):
        batch = pdf_files[i:i + batch_size]

        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {executor.submit(process_pdf, pdf): pdf for pdf in batch}
            batch_results = []
            for future in as_completed(futures):
                batch_results.append(future.result())

        all_results.extend(batch_results)
        gc.collect()  # Force garbage collection between batches

    return all_results

The key changes:

Process in batches of 100 (adjust based on your memory constraints)
Force garbage collection between batches
Each batch is processed independently, limiting memory pressure

Scaling Up

For truly large datasets, you might need more sophisticated approaches:

# Advanced: streaming results to avoid memory buildup
import json
from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor, as_completed

def process_pdf_streaming(pdf_path: Path, output_file: Path) -> None:
    try:
        doc = PdfDocument(str(pdf_path))
        result = {
            "path": str(pdf_path),
            "pages": doc.page_count,
            "text": "\n".join(doc.extract_text(i) for i in range(doc.page_count)),
            "status": "success"
        }
    except Exception as e:
        result = {"path": str(pdf_path), "error": str(e), "status": "failed"}

    # Write immediately, don't keep in memory
    with open(output_file, "a") as f:
        f.write(json.dumps(result) + "\n")

def batch_process_streaming(pdf_dir: Path, output_file: Path, batch_size: int = 100, max_workers: int = 8) -> None:
    pdf_files = list(pdf_dir.glob("*.pdf"))

    # Clear output file
    output_file.write_text("")

    for i in range(0, len(pdf_files), batch_size):
        batch = pdf_files[i:i + batch_size]

        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {executor.submit(process_pdf_streaming, pdf, output_file): pdf for pdf in batch}
            for future in as_completed(futures):
                future.result()  # We're streaming, results go directly to file

This streams results to disk as they’re processed, keeping memory usage constant regardless of dataset size.

Choosing Your Approach

Here’s when to use each method:

Basic batch processing: When you have less than 1000 PDFs and enough RAM. Simple and effective.

Memory-safe batching: When you have 1000-10,000 PDFs or memory constraints. The sweet spot for most use cases.

Streaming output: When you have 10,000+ PDFs or need to process continuously. Scales indefinitely.

Summary

In this post, I showed how to process thousands of PDFs efficiently. The key point is parallel processing with proper memory management.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 pdf-oxide GitHub Repository
👨‍💻 Python concurrent.futures Documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!