Skip to content

How to Use OpenDataLoader PDF: Quick Start Guide for Python, Node.js, and Java

Purpose

This post demonstrates how to get started with OpenDataLoader PDF for extracting text from PDFs programmatically in Python, Node.js, and Java.

I needed to extract text from PDF files for a RAG (Retrieval-Augmented Generation) project. I tried several PDF parsing libraries, but most of them struggled with complex layouts, tables, and multi-column documents. Then I found OpenDataLoader PDF, which converts PDFs to Markdown, JSON, or HTML formats with excellent accuracy.

Environment

  • Python 3.10+
  • Node.js 20.19.0+ (for Node.js wrapper)
  • Java 11+ (required for the core engine)
  • macOS / Linux / Windows

What is OpenDataLoader PDF?

OpenDataLoader PDF is a PDF parsing library that converts PDF documents into structured formats like Markdown, JSON, and HTML. It is written in Java and provides thin wrappers for Python and Node.js.

The key features:

  • Converts PDFs to Markdown, JSON, HTML, or annotated PDF
  • Preserves document structure (headings, tables, lists)
  • Extracts images with Base64 embedding support
  • Uses native PDF structure tags for better accuracy

Prerequisites: Java 11+

Before you start, OpenDataLoader PDF requires Java 11+ because the core parsing engine is written in Java. The Python and Node.js packages are thin wrappers around the Java CLI.

Check your Java version:

Check Java version
java -version

You should see output like:

openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8)
OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode)

If Java is not installed, download JDK 11+ from Adoptium.

Installation

Python

Install via pip
pip install -U opendataloader-pdf

Node.js

Install via npm
npm install @opendataloader/pdf

Java (Maven)

If you are using Java directly, add the Maven dependency:

pom.xml
<dependency>
<groupId>org.opendataloader</groupId>
<artifactId>opendataloader-pdf-core</artifactId>
<version>1.0.0</version>
</dependency>

Basic Usage

Python Example

Here is a simple example to convert a PDF file to Markdown:

convert_pdf.py
import opendataloader_pdf
# Convert single file
opendataloader_pdf.convert(
input_path=["document.pdf"],
output_dir="output/",
format="markdown"
)

After running the script:

Run the script
python convert_pdf.py

You will find the output in the output/ directory:

output/
└── document.md

Convert Multiple Files

You can convert multiple files at once:

convert_multiple.py
import opendataloader_pdf
# Convert multiple files
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="markdown,json"
)

Process Entire Directory

To process an entire directory recursively:

convert_directory.py
import opendataloader_pdf
# Process entire directory (recursive)
opendataloader_pdf.convert(
input_path="documents/",
output_dir="output/",
format="json"
)

Performance Tip: Batch Your Files

Each convert() call spawns a JVM process. This means there is approximately 1-2 seconds of overhead per invocation. For best performance, batch all files in one call:

batch_example.py
import opendataloader_pdf
# GOOD: One call, one JVM startup
opendataloader_pdf.convert(
input_path=["f1.pdf", "f2.pdf", "f3.pdf", "folder/"],
output_dir="output/",
format="markdown"
)
# BAD: Multiple calls, multiple JVM startups (slow!)
# for f in files:
# opendataloader_pdf.convert(input_path=[f], ...)

This is a significant performance consideration. If you have 100 PDF files, calling convert() once with all files is much faster than calling it 100 times.

Node.js Example

convert.ts
import { convert } from '@opendataloader/pdf';
await convert(['file1.pdf', 'file2.pdf', 'folder/'], {
outputDir: 'output/',
format: 'markdown,json'
});

Command Line Usage

You can also use the CLI directly:

CLI usage
# Batch all files in one call
opendataloader-pdf file1.pdf file2.pdf folder/ --output-dir output/ --format json,markdown

Full CLI options:

CLI with all options
opendataloader-pdf \
file1.pdf file2.pdf folder/ \
--output-dir output/ \
--format json,markdown \
--use-struct-tree \
--quiet

Output Formats

FormatUse Case
markdownClean text for LLM context, RAG chunks
jsonStructured data with bounding boxes
htmlWeb display with styling
pdfAnnotated PDF (visual debugging)

You can combine multiple formats:

multi_format.py
opendataloader_pdf.convert(
input_path=["report.pdf"],
output_dir="output/",
format="json,markdown" # Output both formats
)

Common Options

Here are the commonly used options:

options_example.py
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=["document.pdf"],
output_dir="output/",
format="json,markdown",
use_struct_tree=True, # Use native PDF structure tags
image_output="embedded", # Embed images as Base64
quiet=True, # Suppress logging
)
OptionDescription
use_struct_treeUse native PDF structure tags for better accuracy
image_outputHow to handle images: embedded (Base64), file, or skip
quietSuppress console output
ocr_languageOCR language for scanned PDFs (e.g., eng, chi_sim)

Complete Python Example: Extract and Process

Here is a complete example that converts a PDF and processes the JSON output:

extract_and_process.py
import opendataloader_pdf
import json
# Step 1: Convert PDF
opendataloader_pdf.convert(
input_path=["report.pdf"],
output_dir="output/",
format="json,markdown",
quiet=True
)
# Step 2: Read JSON output
with open("output/report.json", encoding="utf-8") as f:
doc = json.load(f)
# Step 3: Extract elements
for element in doc["kids"]:
print(f"Type: {element['type']}")
print(f"Page: {element.get('page number')}")
print(f"Content: {element.get('content', '')[:100]}...")

The JSON output contains structured data with element types, page numbers, and bounding boxes, making it easy to process programmatically.

Troubleshooting

Java Not Found

If you see this error:

Error: Java executable not found

Solution: Install JDK 11+ and ensure java is on your PATH. On macOS, you can use:

Install Java on macOS
brew install openjdk@17

Slow Processing

If processing is slow, check if you are making multiple convert() calls:

# BAD: Multiple calls, each starts a new JVM
for f in files:
opendataloader_pdf.convert(input_path=[f], ...)
# GOOD: Single call, one JVM
opendataloader_pdf.convert(input_path=files, ...)

Permission Denied

If you get permission errors, ensure the output directory is writable:

Check permissions
ls -la output/
chmod 755 output/

Summary

In this post, I showed how to use OpenDataLoader PDF to extract text from PDFs in Python, Node.js, and Java. The key points:

  1. Install with pip install -U opendataloader-pdf (Python) or npm install @opendataloader/pdf (Node.js)
  2. Requires Java 11+ because the core engine is Java-based
  3. Use the convert() function to process PDFs
  4. Batch all files in one call to avoid repeated JVM startup overhead
  5. Output formats include Markdown, JSON, HTML, and annotated PDF

For complex documents with tables and multi-column layouts, OpenDataLoader PDF handles them well. If you need more advanced features like OCR for scanned PDFs or hybrid mode for difficult documents, check the documentation on the GitHub repository.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments