Skip to content

OpenDataLoader vs Docling vs Marker vs PyMuPDF4LLM: PDF Parsing Benchmark Comparison

Purpose

When choosing a PDF parser for your RAG pipeline, you have many open-source options. This post compares the top contenders using published benchmark data across 200 real-world PDFs.

Overall Accuracy

EngineOverallReading OrderTableHeadingSpeedLicense
OpenDataLoader hybrid0.9070.9340.9280.8210.463s/pApache-2.0
nutrient0.8850.9250.7080.8190.008s/pCommercial
docling0.8820.8980.8870.8240.762s/pMIT
marker0.8610.8900.8080.79653.9s/pGPL-3.0
unstructured hi_res0.8410.9040.5880.7493.0s/pApache-2.0
OpenDataLoader local0.8310.9020.4890.7390.015s/pApache-2.0
MinerU0.8310.8570.8730.7436.0s/pAGPL-3.0
PyMuPDF4LLM0.7320.8850.4010.4120.091s/pAGPL-3.0

Key Findings

OpenDataLoader hybrid leads overall at 0.907. The gap is widest in table extraction (0.928 vs 0.887 for docling, +4.6%) and reading order (0.934 vs 0.898 for docling, +3.6%).

Speed vs accuracy: OpenDataLoader local mode is the fastest open-source option at 0.015s/page (66 pages/s). Hybrid mode at 0.463s/page is competitive with docling (0.762s/page) while being more accurate. marker is 100x slower at 53.9s/page with lower accuracy.

What Benchmarks Don’t Capture

  • Bounding boxes: OpenDataLoader provides them for every element by default. No other open-source parser does this.
  • AI safety: Built-in prompt injection protection (hidden text, zero-size fonts, off-page content).
  • Accessibility: Auto-tagging to Tagged PDF is unique to OpenDataLoader.
  • License: Apache 2.0 is fully permissive vs GPL-3.0 (marker) and AGPL-3.0 (MinerU, PyMuPDF4LLM).
  • No GPU: OpenDataLoader runs on CPU; marker requires GPU.

Choosing Based on This Data

If you choose OpenDataLoader
import opendataloader_pdf
# Local mode (fastest)
opendataloader_pdf.convert(input_path=["doc.pdf"], output_dir="output/", format="markdown")
# Hybrid mode (most accurate)
opendataloader_pdf.convert(input_path=["doc.pdf"], output_dir="output/", format="markdown", hybrid="docling-fast")

Summary

In this post, I compared the top open-source PDF parsers using published benchmarks. The key point is that OpenDataLoader hybrid leads in accuracy (0.907), is the only option with bounding boxes for every element, includes AI safety, offers PDF auto-tagging, and runs without GPU under Apache 2.0.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments