How to Use OpenDataLoader PDF: Quick Start Guide for Python, Node.js, and Java

Mar 22, 2026

Purpose

This post demonstrates how to get started with OpenDataLoader PDF for extracting text from PDFs programmatically in Python, Node.js, and Java.

I needed to extract text from PDF files for a RAG (Retrieval-Augmented Generation) project. I tried several PDF parsing libraries, but most of them struggled with complex layouts, tables, and multi-column documents. Then I found OpenDataLoader PDF, which converts PDFs to Markdown, JSON, or HTML formats with excellent accuracy.

Environment

Python 3.10+
Node.js 20.19.0+ (for Node.js wrapper)
Java 11+ (required for the core engine)
macOS / Linux / Windows

What is OpenDataLoader PDF?

OpenDataLoader PDF is a PDF parsing library that converts PDF documents into structured formats like Markdown, JSON, and HTML. It is written in Java and provides thin wrappers for Python and Node.js.

The key features:

Converts PDFs to Markdown, JSON, HTML, or annotated PDF
Preserves document structure (headings, tables, lists)
Extracts images with Base64 embedding support
Uses native PDF structure tags for better accuracy

Prerequisites: Java 11+

Before you start, OpenDataLoader PDF requires Java 11+ because the core parsing engine is written in Java. The Python and Node.js packages are thin wrappers around the Java CLI.

Check your Java version:

java -version

You should see output like:

openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8)
OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode)

If Java is not installed, download JDK 11+ from Adoptium.

Installation

Python

pip install -U opendataloader-pdf

Node.js

npm install @opendataloader/pdf

Java (Maven)

If you are using Java directly, add the Maven dependency:

&lt;dependency&gt;
  &lt;groupId&gt;org.opendataloader&lt;/groupId&gt;
  &lt;artifactId&gt;opendataloader-pdf-core&lt;/artifactId&gt;
  &lt;version&gt;1.0.0&lt;/version&gt;
&lt;/dependency&gt;

Basic Usage

Python Example

Here is a simple example to convert a PDF file to Markdown:

import opendataloader_pdf

# Convert single file
opendataloader_pdf.convert(
    input_path=["document.pdf"],
    output_dir="output/",
    format="markdown"
)

After running the script:

python convert_pdf.py

You will find the output in the output/ directory:

output/
└── document.md

Convert Multiple Files

You can convert multiple files at once:

import opendataloader_pdf

# Convert multiple files
opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json"
)

Process Entire Directory

To process an entire directory recursively:

import opendataloader_pdf

# Process entire directory (recursive)
opendataloader_pdf.convert(
    input_path="documents/",
    output_dir="output/",
    format="json"
)

Performance Tip: Batch Your Files

Each convert() call spawns a JVM process. This means there is approximately 1-2 seconds of overhead per invocation. For best performance, batch all files in one call:

import opendataloader_pdf

# GOOD: One call, one JVM startup
opendataloader_pdf.convert(
    input_path=["f1.pdf", "f2.pdf", "f3.pdf", "folder/"],
    output_dir="output/",
    format="markdown"
)

# BAD: Multiple calls, multiple JVM startups (slow!)
# for f in files:
#     opendataloader_pdf.convert(input_path=[f], ...)

This is a significant performance consideration. If you have 100 PDF files, calling convert() once with all files is much faster than calling it 100 times.

Node.js Example

import { convert } from '@opendataloader/pdf';

await convert(['file1.pdf', 'file2.pdf', 'folder/'], {
  outputDir: 'output/',
  format: 'markdown,json'
});

Command Line Usage

You can also use the CLI directly:

# Batch all files in one call
opendataloader-pdf file1.pdf file2.pdf folder/ --output-dir output/ --format json,markdown

Full CLI options:

opendataloader-pdf \
    file1.pdf file2.pdf folder/ \
    --output-dir output/ \
    --format json,markdown \
    --use-struct-tree \
    --quiet

Output Formats

Format	Use Case
`markdown`	Clean text for LLM context, RAG chunks
`json`	Structured data with bounding boxes
`html`	Web display with styling
`pdf`	Annotated PDF (visual debugging)

You can combine multiple formats:

opendataloader_pdf.convert(
    input_path=["report.pdf"],
    output_dir="output/",
    format="json,markdown"  # Output both formats
)

Common Options

Here are the commonly used options:

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["document.pdf"],
    output_dir="output/",
    format="json,markdown",
    use_struct_tree=True,      # Use native PDF structure tags
    image_output="embedded",   # Embed images as Base64
    quiet=True,                # Suppress logging
)

Option	Description
`use_struct_tree`	Use native PDF structure tags for better accuracy
`image_output`	How to handle images: `embedded` (Base64), `file`, or `skip`
`quiet`	Suppress console output
`ocr_language`	OCR language for scanned PDFs (e.g., `eng`, `chi_sim`)

Complete Python Example: Extract and Process

Here is a complete example that converts a PDF and processes the JSON output:

import opendataloader_pdf
import json

# Step 1: Convert PDF
opendataloader_pdf.convert(
    input_path=["report.pdf"],
    output_dir="output/",
    format="json,markdown",
    quiet=True
)

# Step 2: Read JSON output
with open("output/report.json", encoding="utf-8") as f:
    doc = json.load(f)

# Step 3: Extract elements
for element in doc["kids"]:
    print(f"Type: {element['type']}")
    print(f"Page: {element.get('page number')}")
    print(f"Content: {element.get('content', '')[:100]}...")

The JSON output contains structured data with element types, page numbers, and bounding boxes, making it easy to process programmatically.

Troubleshooting

Java Not Found

If you see this error:

Error: Java executable not found

Solution: Install JDK 11+ and ensure java is on your PATH. On macOS, you can use:

brew install openjdk@17

Slow Processing

If processing is slow, check if you are making multiple convert() calls:

# BAD: Multiple calls, each starts a new JVM
for f in files:
    opendataloader_pdf.convert(input_path=[f], ...)

# GOOD: Single call, one JVM
opendataloader_pdf.convert(input_path=files, ...)

Permission Denied

If you get permission errors, ensure the output directory is writable:

ls -la output/
chmod 755 output/

Summary

In this post, I showed how to use OpenDataLoader PDF to extract text from PDFs in Python, Node.js, and Java. The key points:

Install with pip install -U opendataloader-pdf (Python) or npm install @opendataloader/pdf (Node.js)
Requires Java 11+ because the core engine is Java-based
Use the convert() function to process PDFs
Batch all files in one call to avoid repeated JVM startup overhead
Output formats include Markdown, JSON, HTML, and annotated PDF

For complex documents with tables and multi-column layouts, OpenDataLoader PDF handles them well. If you need more advanced features like OCR for scanned PDFs or hybrid mode for difficult documents, check the documentation on the GitHub repository.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 OpenDataLoader PDF GitHub Repository
👨‍💻 Adoptium - Download JDK 11+
👨‍💻 Python pip documentation
👨‍💻 npm documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!