Skip to content

How to Extract Tables from PDF with Highest Accuracy Using OpenDataLoader

Problem

PDF tables are notoriously difficult to extract. Borderless tables, merged cells, nested tables, and multi-page tables break most parsers. Most open-source tools achieve less than 0.5 TEDS (Table Extraction Detection Score) on complex tables.

Two Modes for Two Table Types

Local Mode (Simple Tables)

The default mode uses border analysis and text clustering. It works well for PDFs with clear borders and simple row/column structure. Speed: 60+ pages per second.

Local mode - simple tables
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=["report.pdf"],
output_dir="output/",
format="json"
)

Hybrid Mode (Complex Tables)

For borderless, nested, or multi-column tables, install the hybrid extra and start the AI backend.

Install hybrid mode
pip install "opendataloader-pdf[hybrid]"
Terminal 1: Start backend
opendataloader-pdf-hybrid --port 5002
Terminal 2: Process with hybrid mode
opendataloader_pdf.convert(
input_path=["complex_report.pdf"],
output_dir="output/",
format="json",
hybrid="docling-fast"
)

TEDS scores: local mode 0.489, hybrid mode 0.928 β€” a +90% improvement.

How Hybrid Mode Works

The TriageProcessor analyzes each page using lightweight heuristics - line/text ratio, grid patterns, existing table detection. Simple tables stay local. Pages with suspected complex tables are routed to the AI backend. Both paths run in parallel, and results merge preserving page order.

Benchmark Context

EngineTable Accuracy
OpenDataLoader hybrid0.928
docling0.887
MinerU0.873
marker0.808
nutrient0.708
unstructured hi_res0.588
PyMuPDF4LLM0.401

Summary

In this post, I showed how to extract tables from PDFs with OpenDataLoader. The key point is to use local mode for simple bordered tables and hybrid mode for complex or borderless tables β€” achieving 0.928 TEDS, the highest open-source accuracy.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments