Skip to content

How to Use OpenDataLoader PDF: Python Quick Start Guide for PDF Extraction

Purpose

This post demonstrates how to install OpenDataLoader PDF and use its Python SDK for batch PDF extraction. You will get clean Markdown and structured JSON data from your PDFs in about 30 seconds.

Prerequisites

  • Java 11+: Check with java -version. Install from Adoptium if needed.
  • Python 3.10+: Check with python --version.

Installation

Install the package
pip install -U opendataloader-pdf

For hybrid mode features (OCR, complex tables, formulas):

Install with hybrid extras
pip install "opendataloader-pdf[hybrid]"

Basic Usage

The primary API is opendataloader_pdf.convert(). Pass a list of input paths (files and directories), an output directory, and a format string.

Batch PDF extraction
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=["report.pdf", "invoice.pdf", "documents/"],
output_dir="output/",
format="markdown,json"
)

Performance Tip

Each convert() call spawns a JVM process. Calling it per-file is slow. Always batch all files into a single call.

CLI equivalent (batch processing)
opendataloader-pdf report.pdf invoice.pdf documents/ -o output/ --format markdown,json

Advanced Options

With custom options
opendataloader_pdf.convert(
input_path=["paper.pdf"],
output_dir="output/",
format="json,markdown",
reading_order="xycut",
image_output="embedded",
quiet=True,
)

Output Formats

FormatUse Case
JSONStructured data with bounding boxes
MarkdownClean text for LLM context
HTMLWeb display
TextPlain text
Annotated PDFVisual debugging
Tagged PDFAccessibility (screen readers)

Summary

In this post, I showed how to install OpenDataLoader PDF and run your first extraction. The key point is to batch all files in a single convert() call to avoid repeated JVM startup overhead.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments