How to Extract Text from PDF in Python Fast
I needed to extract text from hundreds of PDFs for a data processing pipeline. The standard Python libraries were too slow. PyPDF2 took around 12ms per page. pdfplumber was even worse at 15ms. When you’re processing thousands of documents, that adds up quickly.
I found pdf_oxide. It extracted text in about 0.8ms per page. That’s a 15× speed improvement. The difference between processing 1,000 PDFs at 12ms per page versus 0.8ms per page is 12 seconds versus 0.8 seconds. For 10,000 files, it’s 2 minutes versus 8 seconds.
Let me show you how to use it.
Installation
First, install the package. It supports Python 3.8 through 3.14.
pip install pdf_oxideBasic Text Extraction
Open a PDF and extract text from a single page.
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")text = doc.extract_text(0) # First pageprint(text)Extract text from all pages:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")all_text = []
for i in range(doc.page_count): text = doc.extract_text(i) all_text.append(text)
full_text = "\n".join(all_text)Character-Level Extraction
When you need character positions for layout-aware applications, use extract_chars. This gives you bounding boxes and coordinates.
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")chars = doc.extract_chars(0)
for ch in chars: print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f})")Each character object includes position data. The x and y values are in PDF coordinate space. The bottom-left corner is typically (0, 0).
Span-Level Extraction
For text with metadata like font information, use extract_spans.
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")spans = doc.extract_spans(0)
for span in spans: print(f"{span.text} (font: {span.font}, size: {span.size})")This is useful when you need to identify headings, distinguish between different text styles, or preserve formatting information.
Password-Protected PDFs
Open encrypted PDFs by providing the password.
from pdf_oxide import PdfDocument
doc = PdfDocument("secure.pdf", password="secret")text = doc.extract_text(0)Batch Processing
For processing multiple files efficiently, keep the document handle open as briefly as possible. Extract what you need, then let the document close.
from pdf_oxide import PdfDocumentfrom pathlib import Path
def process_pdf(file_path): doc = PdfDocument(file_path) results = []
for i in range(doc.page_count): text = doc.extract_text(i) results.append(text)
# Document closes automatically when doc goes out of scope return "\n".join(results)
pdf_dir = Path("pdfs")for pdf_file in pdf_dir.glob("*.pdf"): text = process_pdf(pdf_file) # Process or save the textThe performance advantage comes from pdf_oxide’s Rust-based implementation. The heavy lifting happens in compiled code, not interpreted Python.
Which Method to Use
The library provides four extraction methods for different needs:
| Method | Returns | Use Case |
|---|---|---|
extract_text(page) | String | Quick text extraction |
extract_spans(page) | Text with metadata | Font, size, color info |
extract_chars(page) | Per-character data | Position-sensitive apps |
extract_images(page) | Image objects | Image extraction |
Start with extract_text for simple text retrieval. Move to extract_chars or extract_spans only when you need the additional data. The more detailed methods are slower than basic text extraction.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments