How to Use OpenDataLoader PDF: Quick Start Guide for Python, Node.js, and Java
Purpose
This post demonstrates how to get started with OpenDataLoader PDF for extracting text from PDFs programmatically in Python, Node.js, and Java.
I needed to extract text from PDF files for a RAG (Retrieval-Augmented Generation) project. I tried several PDF parsing libraries, but most of them struggled with complex layouts, tables, and multi-column documents. Then I found OpenDataLoader PDF, which converts PDFs to Markdown, JSON, or HTML formats with excellent accuracy.
Environment
- Python 3.10+
- Node.js 20.19.0+ (for Node.js wrapper)
- Java 11+ (required for the core engine)
- macOS / Linux / Windows
What is OpenDataLoader PDF?
OpenDataLoader PDF is a PDF parsing library that converts PDF documents into structured formats like Markdown, JSON, and HTML. It is written in Java and provides thin wrappers for Python and Node.js.
The key features:
- Converts PDFs to Markdown, JSON, HTML, or annotated PDF
- Preserves document structure (headings, tables, lists)
- Extracts images with Base64 embedding support
- Uses native PDF structure tags for better accuracy
Prerequisites: Java 11+
Before you start, OpenDataLoader PDF requires Java 11+ because the core parsing engine is written in Java. The Python and Node.js packages are thin wrappers around the Java CLI.
Check your Java version:
java -versionYou should see output like:
openjdk version "17.0.2" 2022-01-18OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8)OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode)If Java is not installed, download JDK 11+ from Adoptium.
Installation
Python
pip install -U opendataloader-pdfNode.js
npm install @opendataloader/pdfJava (Maven)
If you are using Java directly, add the Maven dependency:
<dependency> <groupId>org.opendataloader</groupId> <artifactId>opendataloader-pdf-core</artifactId> <version>1.0.0</version></dependency>Basic Usage
Python Example
Here is a simple example to convert a PDF file to Markdown:
import opendataloader_pdf
# Convert single fileopendataloader_pdf.convert( input_path=["document.pdf"], output_dir="output/", format="markdown")After running the script:
python convert_pdf.pyYou will find the output in the output/ directory:
output/└── document.mdConvert Multiple Files
You can convert multiple files at once:
import opendataloader_pdf
# Convert multiple filesopendataloader_pdf.convert( input_path=["file1.pdf", "file2.pdf", "folder/"], output_dir="output/", format="markdown,json")Process Entire Directory
To process an entire directory recursively:
import opendataloader_pdf
# Process entire directory (recursive)opendataloader_pdf.convert( input_path="documents/", output_dir="output/", format="json")Performance Tip: Batch Your Files
Each convert() call spawns a JVM process. This means there is approximately 1-2 seconds of overhead per invocation. For best performance, batch all files in one call:
import opendataloader_pdf
# GOOD: One call, one JVM startupopendataloader_pdf.convert( input_path=["f1.pdf", "f2.pdf", "f3.pdf", "folder/"], output_dir="output/", format="markdown")
# BAD: Multiple calls, multiple JVM startups (slow!)# for f in files:# opendataloader_pdf.convert(input_path=[f], ...)This is a significant performance consideration. If you have 100 PDF files, calling convert() once with all files is much faster than calling it 100 times.
Node.js Example
import { convert } from '@opendataloader/pdf';
await convert(['file1.pdf', 'file2.pdf', 'folder/'], { outputDir: 'output/', format: 'markdown,json'});Command Line Usage
You can also use the CLI directly:
# Batch all files in one callopendataloader-pdf file1.pdf file2.pdf folder/ --output-dir output/ --format json,markdownFull CLI options:
opendataloader-pdf \ file1.pdf file2.pdf folder/ \ --output-dir output/ \ --format json,markdown \ --use-struct-tree \ --quietOutput Formats
| Format | Use Case |
|---|---|
markdown | Clean text for LLM context, RAG chunks |
json | Structured data with bounding boxes |
html | Web display with styling |
pdf | Annotated PDF (visual debugging) |
You can combine multiple formats:
opendataloader_pdf.convert( input_path=["report.pdf"], output_dir="output/", format="json,markdown" # Output both formats)Common Options
Here are the commonly used options:
import opendataloader_pdf
opendataloader_pdf.convert( input_path=["document.pdf"], output_dir="output/", format="json,markdown", use_struct_tree=True, # Use native PDF structure tags image_output="embedded", # Embed images as Base64 quiet=True, # Suppress logging)| Option | Description |
|---|---|
use_struct_tree | Use native PDF structure tags for better accuracy |
image_output | How to handle images: embedded (Base64), file, or skip |
quiet | Suppress console output |
ocr_language | OCR language for scanned PDFs (e.g., eng, chi_sim) |
Complete Python Example: Extract and Process
Here is a complete example that converts a PDF and processes the JSON output:
import opendataloader_pdfimport json
# Step 1: Convert PDFopendataloader_pdf.convert( input_path=["report.pdf"], output_dir="output/", format="json,markdown", quiet=True)
# Step 2: Read JSON outputwith open("output/report.json", encoding="utf-8") as f: doc = json.load(f)
# Step 3: Extract elementsfor element in doc["kids"]: print(f"Type: {element['type']}") print(f"Page: {element.get('page number')}") print(f"Content: {element.get('content', '')[:100]}...")The JSON output contains structured data with element types, page numbers, and bounding boxes, making it easy to process programmatically.
Troubleshooting
Java Not Found
If you see this error:
Error: Java executable not foundSolution: Install JDK 11+ and ensure java is on your PATH. On macOS, you can use:
brew install openjdk@17Slow Processing
If processing is slow, check if you are making multiple convert() calls:
# BAD: Multiple calls, each starts a new JVMfor f in files: opendataloader_pdf.convert(input_path=[f], ...)
# GOOD: Single call, one JVMopendataloader_pdf.convert(input_path=files, ...)Permission Denied
If you get permission errors, ensure the output directory is writable:
ls -la output/chmod 755 output/Summary
In this post, I showed how to use OpenDataLoader PDF to extract text from PDFs in Python, Node.js, and Java. The key points:
- Install with
pip install -U opendataloader-pdf(Python) ornpm install @opendataloader/pdf(Node.js) - Requires Java 11+ because the core engine is Java-based
- Use the
convert()function to process PDFs - Batch all files in one call to avoid repeated JVM startup overhead
- Output formats include Markdown, JSON, HTML, and annotated PDF
For complex documents with tables and multi-column layouts, OpenDataLoader PDF handles them well. If you need more advanced features like OCR for scanned PDFs or hybrid mode for difficult documents, check the documentation on the GitHub repository.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 OpenDataLoader PDF GitHub Repository
- 👨💻 Adoptium - Download JDK 11+
- 👨💻 Python pip documentation
- 👨💻 npm documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments