How to Extract Tables from PDF with Highest Accuracy Using OpenDataLoader
Problem
PDF tables are notoriously difficult to extract. Borderless tables, merged cells, nested tables, and multi-page tables break most parsers. Most open-source tools achieve less than 0.5 TEDS (Table Extraction Detection Score) on complex tables.
Two Modes for Two Table Types
Local Mode (Simple Tables)
The default mode uses border analysis and text clustering. It works well for PDFs with clear borders and simple row/column structure. Speed: 60+ pages per second.
import opendataloader_pdf
opendataloader_pdf.convert( input_path=["report.pdf"], output_dir="output/", format="json")Hybrid Mode (Complex Tables)
For borderless, nested, or multi-column tables, install the hybrid extra and start the AI backend.
pip install "opendataloader-pdf[hybrid]"opendataloader-pdf-hybrid --port 5002opendataloader_pdf.convert( input_path=["complex_report.pdf"], output_dir="output/", format="json", hybrid="docling-fast")TEDS scores: local mode 0.489, hybrid mode 0.928 β a +90% improvement.
How Hybrid Mode Works
The TriageProcessor analyzes each page using lightweight heuristics - line/text ratio, grid patterns, existing table detection. Simple tables stay local. Pages with suspected complex tables are routed to the AI backend. Both paths run in parallel, and results merge preserving page order.
Benchmark Context
| Engine | Table Accuracy |
|---|---|
| OpenDataLoader hybrid | 0.928 |
| docling | 0.887 |
| MinerU | 0.873 |
| marker | 0.808 |
| nutrient | 0.708 |
| unstructured hi_res | 0.588 |
| PyMuPDF4LLM | 0.401 |
Summary
In this post, I showed how to extract tables from PDFs with OpenDataLoader. The key point is to use local mode for simple bordered tables and hybrid mode for complex or borderless tables β achieving 0.928 TEDS, the highest open-source accuracy.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- π¨βπ» OpenDataLoader PDF GitHub Repository
- π¨βπ» OpenDataLoader Hybrid Mode Design
Oh, and if you found these resources useful, donβt forget to support me by starring the repo on GitHub!
Comments