OpenDataLoader vs Docling vs Marker vs PyMuPDF4LLM: PDF Parsing Benchmark Comparison
Purpose
When choosing a PDF parser for your RAG pipeline, you have many open-source options. This post compares the top contenders using published benchmark data across 200 real-world PDFs.
Overall Accuracy
| Engine | Overall | Reading Order | Table | Heading | Speed | License |
|---|---|---|---|---|---|---|
| OpenDataLoader hybrid | 0.907 | 0.934 | 0.928 | 0.821 | 0.463s/p | Apache-2.0 |
| nutrient | 0.885 | 0.925 | 0.708 | 0.819 | 0.008s/p | Commercial |
| docling | 0.882 | 0.898 | 0.887 | 0.824 | 0.762s/p | MIT |
| marker | 0.861 | 0.890 | 0.808 | 0.796 | 53.9s/p | GPL-3.0 |
| unstructured hi_res | 0.841 | 0.904 | 0.588 | 0.749 | 3.0s/p | Apache-2.0 |
| OpenDataLoader local | 0.831 | 0.902 | 0.489 | 0.739 | 0.015s/p | Apache-2.0 |
| MinerU | 0.831 | 0.857 | 0.873 | 0.743 | 6.0s/p | AGPL-3.0 |
| PyMuPDF4LLM | 0.732 | 0.885 | 0.401 | 0.412 | 0.091s/p | AGPL-3.0 |
Key Findings
OpenDataLoader hybrid leads overall at 0.907. The gap is widest in table extraction (0.928 vs 0.887 for docling, +4.6%) and reading order (0.934 vs 0.898 for docling, +3.6%).
Speed vs accuracy: OpenDataLoader local mode is the fastest open-source option at 0.015s/page (66 pages/s). Hybrid mode at 0.463s/page is competitive with docling (0.762s/page) while being more accurate. marker is 100x slower at 53.9s/page with lower accuracy.
What Benchmarks Don’t Capture
- Bounding boxes: OpenDataLoader provides them for every element by default. No other open-source parser does this.
- AI safety: Built-in prompt injection protection (hidden text, zero-size fonts, off-page content).
- Accessibility: Auto-tagging to Tagged PDF is unique to OpenDataLoader.
- License: Apache 2.0 is fully permissive vs GPL-3.0 (marker) and AGPL-3.0 (MinerU, PyMuPDF4LLM).
- No GPU: OpenDataLoader runs on CPU; marker requires GPU.
Choosing Based on This Data
import opendataloader_pdf
# Local mode (fastest)opendataloader_pdf.convert(input_path=["doc.pdf"], output_dir="output/", format="markdown")
# Hybrid mode (most accurate)opendataloader_pdf.convert(input_path=["doc.pdf"], output_dir="output/", format="markdown", hybrid="docling-fast")Summary
In this post, I compared the top open-source PDF parsers using published benchmarks. The key point is that OpenDataLoader hybrid leads in accuracy (0.907), is the only option with bounding boxes for every element, includes AI safety, offers PDF auto-tagging, and runs without GPU under Apache 2.0.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 OpenDataLoader PDF Benchmark
- 👨💻 Docling by IBM
- 👨💻 Marker
- 👨💻 PyMuPDF4LLM
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments