Skip to content

How to Extract Formulas and Charts from Scientific PDFs with OpenDataLoader

Problem

When I processed scientific PDFs through my RAG pipeline, I noticed that mathematical formulas and chart data were completely missing from the output. Standard text extraction just skips them — formulas are rendered as vector graphics, and charts are embedded images. For researchers and analysts relying on these documents, this means critical information is lost.

Typical text extraction from a scientific PDF
The results show that as temperature increases, [missing formula]
Figure 1 illustrates the relationship between [missing chart description]

I needed a way to extract LaTeX formulas and generate descriptions for charts and figures automatically.

How OpenDataLoader handles it

OpenDataLoader’s hybrid mode has two enrichment flags for this exact purpose:

  • --enrich-formula — detects mathematical formulas and converts them to LaTeX strings
  • --enrich-picture-description — uses a lightweight vision model (SmolVLM 256M) to describe charts and images

Both require hybrid mode enabled on the client side with --hybrid-mode full.

Step 1: Start the backend with enrichment

First, start the hybrid backend with the enrichment flags you need:

Start backend with formula enrichment
opendataloader-pdf-hybrid --enrich-formula

For chart descriptions:

Start backend with chart description enrichment
opendataloader-pdf-hybrid --enrich-picture-description

Or both:

Start backend with both enrichments
opendataloader-pdf-hybrid --enrich-formula --enrich-picture-description

You can also customize the chart description prompt:

Custom prompt for chart descriptions
opendataloader-pdf-hybrid --enrich-picture-description \
--picture-description-prompt "Describe this chart in detail, focusing on trends and data points"

Step 2: Process PDFs with full hybrid mode

On the client side, you must use --hybrid-mode full — otherwise enrichment flags are silently ignored:

Process with full hybrid mode
opendataloader-pdf --hybrid docling-fast --hybrid-mode full paper.pdf -o output/

Or using the Python API:

Python API with full hybrid mode
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=["paper.pdf"],
output_dir="output/",
format="json",
hybrid="docling-fast",
hybrid_mode="full"
)

What you get

Formula extraction output

Each detected formula is output as a JSON element with the LaTeX content and bounding box coordinates:

Formula extraction example
{
"type": "formula",
"page number": 1,
"bounding box": [226.2, 144.7, 377.1, 168.7],
"content": "\\frac{f(x+h) - f(x)}{h}"
}

This means you can render the formula in a LaTeX-compatible viewer or keep it as structured data for search indexing.

Chart/image description output

Charts and images get natural language descriptions generated by SmolVLM:

Chart description example
{
"type": "picture",
"page number": 3,
"bounding box": [72.0, 400.0, 540.0, 600.0],
"description": "A bar chart showing waste generation by region from 2016 to 2030. East Asia shows the highest values with a steady upward trend."
}

These descriptions are full-text searchable — a huge advantage for RAG systems. Instead of searching “I can’t find anything” over binary image data, you can now retrieve chart content by its description.

Important: full hybrid mode is required

Both enrichments only run on the AI backend. If you don’t pass --hybrid-mode full on the client side, the enrichment flags are silently ignored. No warning, no error — just no enrichment. I missed this the first time and wasted 30 minutes debugging.

Correct — full mode
opendataloader-pdf --hybrid docling-fast --hybrid-mode full paper.pdf -o output/
Wrong — enrichment will be ignored
opendataloader-pdf --hybrid docling-fast --hybrid-mode light paper.pdf -o output/

The reason this matters

Scientific PDFs are full of non-text content. Standard parsers treat formulas and charts as opaque objects. By enriching the output with LaTeX strings and AI descriptions, you:

  • Make math searchable in your RAG index
  • Enable accessibility alt text for visually impaired users
  • Preserve the semantic meaning of charts and graphs
  • Get bounding box coordinates for precise source citation

All of this runs locally on CPU and is free under Apache 2.0.

Summary

In this post, I showed how to extract LaTeX formulas and AI-generated chart descriptions from scientific PDFs using OpenDataLoader hybrid mode. The key point is enabling --enrich-formula and/or --enrich-picture-description on the backend and using --hybrid-mode full on the client — without full mode, enrichment flags are silently ignored.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments