How to Extract Formulas and Charts from Scientific PDFs with OpenDataLoader

Jun 4, 2026

Problem

When I processed scientific PDFs through my RAG pipeline, I noticed that mathematical formulas and chart data were completely missing from the output. Standard text extraction just skips them — formulas are rendered as vector graphics, and charts are embedded images. For researchers and analysts relying on these documents, this means critical information is lost.

The results show that as temperature increases, [missing formula]
Figure 1 illustrates the relationship between [missing chart description]

I needed a way to extract LaTeX formulas and generate descriptions for charts and figures automatically.

How OpenDataLoader handles it

OpenDataLoader’s hybrid mode has two enrichment flags for this exact purpose:

--enrich-formula — detects mathematical formulas and converts them to LaTeX strings
--enrich-picture-description — uses a lightweight vision model (SmolVLM 256M) to describe charts and images

Both require hybrid mode enabled on the client side with --hybrid-mode full.

Step 1: Start the backend with enrichment

First, start the hybrid backend with the enrichment flags you need:

opendataloader-pdf-hybrid --enrich-formula

For chart descriptions:

opendataloader-pdf-hybrid --enrich-picture-description

Or both:

opendataloader-pdf-hybrid --enrich-formula --enrich-picture-description

You can also customize the chart description prompt:

opendataloader-pdf-hybrid --enrich-picture-description \
  --picture-description-prompt "Describe this chart in detail, focusing on trends and data points"

Step 2: Process PDFs with full hybrid mode

On the client side, you must use --hybrid-mode full — otherwise enrichment flags are silently ignored:

opendataloader-pdf --hybrid docling-fast --hybrid-mode full paper.pdf -o output/

Or using the Python API:

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["paper.pdf"],
    output_dir="output/",
    format="json",
    hybrid="docling-fast",
    hybrid_mode="full"
)

What you get

Formula extraction output

Each detected formula is output as a JSON element with the LaTeX content and bounding box coordinates:

{
  "type": "formula",
  "page number": 1,
  "bounding box": [226.2, 144.7, 377.1, 168.7],
  "content": "\\frac{f(x+h) - f(x)}{h}"
}

This means you can render the formula in a LaTeX-compatible viewer or keep it as structured data for search indexing.

Chart/image description output

Charts and images get natural language descriptions generated by SmolVLM:

{
  "type": "picture",
  "page number": 3,
  "bounding box": [72.0, 400.0, 540.0, 600.0],
  "description": "A bar chart showing waste generation by region from 2016 to 2030. East Asia shows the highest values with a steady upward trend."
}

These descriptions are full-text searchable — a huge advantage for RAG systems. Instead of searching “I can’t find anything” over binary image data, you can now retrieve chart content by its description.

Important: full hybrid mode is required

Both enrichments only run on the AI backend. If you don’t pass --hybrid-mode full on the client side, the enrichment flags are silently ignored. No warning, no error — just no enrichment. I missed this the first time and wasted 30 minutes debugging.

opendataloader-pdf --hybrid docling-fast --hybrid-mode full paper.pdf -o output/

opendataloader-pdf --hybrid docling-fast --hybrid-mode light paper.pdf -o output/

The reason this matters

Scientific PDFs are full of non-text content. Standard parsers treat formulas and charts as opaque objects. By enriching the output with LaTeX strings and AI descriptions, you:

Make math searchable in your RAG index
Enable accessibility alt text for visually impaired users
Preserve the semantic meaning of charts and graphs
Get bounding box coordinates for precise source citation

All of this runs locally on CPU and is free under Apache 2.0.

Summary

In this post, I showed how to extract LaTeX formulas and AI-generated chart descriptions from scientific PDFs using OpenDataLoader hybrid mode. The key point is enabling --enrich-formula and/or --enrich-picture-description on the backend and using --hybrid-mode full on the client — without full mode, enrichment flags are silently ignored.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!