How to Extract Tables from PDF with Highest Accuracy Using OpenDataLoader

Jun 4, 2026

Problem

PDF tables are notoriously difficult to extract. Borderless tables, merged cells, nested tables, and multi-page tables break most parsers. Most open-source tools achieve less than 0.5 TEDS (Table Extraction Detection Score) on complex tables.

Two Modes for Two Table Types

Local Mode (Simple Tables)

The default mode uses border analysis and text clustering. It works well for PDFs with clear borders and simple row/column structure. Speed: 60+ pages per second.

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["report.pdf"],
    output_dir="output/",
    format="json"
)

Hybrid Mode (Complex Tables)

For borderless, nested, or multi-column tables, install the hybrid extra and start the AI backend.

pip install "opendataloader-pdf[hybrid]"

opendataloader-pdf-hybrid --port 5002

opendataloader_pdf.convert(
    input_path=["complex_report.pdf"],
    output_dir="output/",
    format="json",
    hybrid="docling-fast"
)

TEDS scores: local mode 0.489, hybrid mode 0.928 — a +90% improvement.

How Hybrid Mode Works

The TriageProcessor analyzes each page using lightweight heuristics - line/text ratio, grid patterns, existing table detection. Simple tables stay local. Pages with suspected complex tables are routed to the AI backend. Both paths run in parallel, and results merge preserving page order.

Benchmark Context

Engine	Table Accuracy
OpenDataLoader hybrid	0.928
docling	0.887
MinerU	0.873
marker	0.808
nutrient	0.708
unstructured hi_res	0.588
PyMuPDF4LLM	0.401

Summary

In this post, I showed how to extract tables from PDFs with OpenDataLoader. The key point is to use local mode for simple bordered tables and hybrid mode for complex or borderless tables — achieving 0.928 TEDS, the highest open-source accuracy.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 OpenDataLoader PDF GitHub Repository
👨‍💻 OpenDataLoader Hybrid Mode Design

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!