OpenDataLoader vs Docling vs Marker vs PyMuPDF4LLM: PDF Parsing Benchmark Comparison

Jun 4, 2026

Purpose

When choosing a PDF parser for your RAG pipeline, you have many open-source options. This post compares the top contenders using published benchmark data across 200 real-world PDFs.

Overall Accuracy

Engine	Overall	Reading Order	Table	Heading	Speed	License
OpenDataLoader hybrid	0.907	0.934	0.928	0.821	0.463s/p	Apache-2.0
nutrient	0.885	0.925	0.708	0.819	0.008s/p	Commercial
docling	0.882	0.898	0.887	0.824	0.762s/p	MIT
marker	0.861	0.890	0.808	0.796	53.9s/p	GPL-3.0
unstructured hi_res	0.841	0.904	0.588	0.749	3.0s/p	Apache-2.0
OpenDataLoader local	0.831	0.902	0.489	0.739	0.015s/p	Apache-2.0
MinerU	0.831	0.857	0.873	0.743	6.0s/p	AGPL-3.0
PyMuPDF4LLM	0.732	0.885	0.401	0.412	0.091s/p	AGPL-3.0

Key Findings

OpenDataLoader hybrid leads overall at 0.907. The gap is widest in table extraction (0.928 vs 0.887 for docling, +4.6%) and reading order (0.934 vs 0.898 for docling, +3.6%).

Speed vs accuracy: OpenDataLoader local mode is the fastest open-source option at 0.015s/page (66 pages/s). Hybrid mode at 0.463s/page is competitive with docling (0.762s/page) while being more accurate. marker is 100x slower at 53.9s/page with lower accuracy.

What Benchmarks Don’t Capture

Bounding boxes: OpenDataLoader provides them for every element by default. No other open-source parser does this.
AI safety: Built-in prompt injection protection (hidden text, zero-size fonts, off-page content).
Accessibility: Auto-tagging to Tagged PDF is unique to OpenDataLoader.
License: Apache 2.0 is fully permissive vs GPL-3.0 (marker) and AGPL-3.0 (MinerU, PyMuPDF4LLM).
No GPU: OpenDataLoader runs on CPU; marker requires GPU.

Choosing Based on This Data

import opendataloader_pdf

# Local mode (fastest)
opendataloader_pdf.convert(input_path=["doc.pdf"], output_dir="output/", format="markdown")

# Hybrid mode (most accurate)
opendataloader_pdf.convert(input_path=["doc.pdf"], output_dir="output/", format="markdown", hybrid="docling-fast")

Summary

In this post, I compared the top open-source PDF parsers using published benchmarks. The key point is that OpenDataLoader hybrid leads in accuracy (0.907), is the only option with bounding boxes for every element, includes AI safety, offers PDF auto-tagging, and runs without GPU under Apache 2.0.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 OpenDataLoader PDF Benchmark
👨‍💻 Docling by IBM
👨‍💻 Marker
👨‍💻 PyMuPDF4LLM

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!