Skip to content

How to Extract Text from PDF in Python Fast

I needed to extract text from hundreds of PDFs for a data processing pipeline. The standard Python libraries were too slow. PyPDF2 took around 12ms per page. pdfplumber was even worse at 15ms. When you’re processing thousands of documents, that adds up quickly.

I found pdf_oxide. It extracted text in about 0.8ms per page. That’s a 15× speed improvement. The difference between processing 1,000 PDFs at 12ms per page versus 0.8ms per page is 12 seconds versus 0.8 seconds. For 10,000 files, it’s 2 minutes versus 8 seconds.

Let me show you how to use it.

Installation

First, install the package. It supports Python 3.8 through 3.14.

Terminal
pip install pdf_oxide

Basic Text Extraction

Open a PDF and extract text from a single page.

extract_single_page.py
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0) # First page
print(text)

Extract text from all pages:

extract_all_pages.py
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
all_text = []
for i in range(doc.page_count):
text = doc.extract_text(i)
all_text.append(text)
full_text = "\n".join(all_text)

Character-Level Extraction

When you need character positions for layout-aware applications, use extract_chars. This gives you bounding boxes and coordinates.

extract_chars.py
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars:
print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f})")

Each character object includes position data. The x and y values are in PDF coordinate space. The bottom-left corner is typically (0, 0).

Span-Level Extraction

For text with metadata like font information, use extract_spans.

extract_spans.py
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
spans = doc.extract_spans(0)
for span in spans:
print(f"{span.text} (font: {span.font}, size: {span.size})")

This is useful when you need to identify headings, distinguish between different text styles, or preserve formatting information.

Password-Protected PDFs

Open encrypted PDFs by providing the password.

password_protected.py
from pdf_oxide import PdfDocument
doc = PdfDocument("secure.pdf", password="secret")
text = doc.extract_text(0)

Batch Processing

For processing multiple files efficiently, keep the document handle open as briefly as possible. Extract what you need, then let the document close.

batch_process.py
from pdf_oxide import PdfDocument
from pathlib import Path
def process_pdf(file_path):
doc = PdfDocument(file_path)
results = []
for i in range(doc.page_count):
text = doc.extract_text(i)
results.append(text)
# Document closes automatically when doc goes out of scope
return "\n".join(results)
pdf_dir = Path("pdfs")
for pdf_file in pdf_dir.glob("*.pdf"):
text = process_pdf(pdf_file)
# Process or save the text

The performance advantage comes from pdf_oxide’s Rust-based implementation. The heavy lifting happens in compiled code, not interpreted Python.

Which Method to Use

The library provides four extraction methods for different needs:

MethodReturnsUse Case
extract_text(page)StringQuick text extraction
extract_spans(page)Text with metadataFont, size, color info
extract_chars(page)Per-character dataPosition-sensitive apps
extract_images(page)Image objectsImage extraction

Start with extract_text for simple text retrieval. Move to extract_chars or extract_spans only when you need the additional data. The more detailed methods are slower than basic text extraction.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments