How to Extract Formulas and Charts from Scientific PDFs with OpenDataLoader
Problem
When I processed scientific PDFs through my RAG pipeline, I noticed that mathematical formulas and chart data were completely missing from the output. Standard text extraction just skips them — formulas are rendered as vector graphics, and charts are embedded images. For researchers and analysts relying on these documents, this means critical information is lost.
The results show that as temperature increases, [missing formula]Figure 1 illustrates the relationship between [missing chart description]I needed a way to extract LaTeX formulas and generate descriptions for charts and figures automatically.
How OpenDataLoader handles it
OpenDataLoader’s hybrid mode has two enrichment flags for this exact purpose:
--enrich-formula— detects mathematical formulas and converts them to LaTeX strings--enrich-picture-description— uses a lightweight vision model (SmolVLM 256M) to describe charts and images
Both require hybrid mode enabled on the client side with --hybrid-mode full.
Step 1: Start the backend with enrichment
First, start the hybrid backend with the enrichment flags you need:
opendataloader-pdf-hybrid --enrich-formulaFor chart descriptions:
opendataloader-pdf-hybrid --enrich-picture-descriptionOr both:
opendataloader-pdf-hybrid --enrich-formula --enrich-picture-descriptionYou can also customize the chart description prompt:
opendataloader-pdf-hybrid --enrich-picture-description \ --picture-description-prompt "Describe this chart in detail, focusing on trends and data points"Step 2: Process PDFs with full hybrid mode
On the client side, you must use --hybrid-mode full — otherwise enrichment flags are silently ignored:
opendataloader-pdf --hybrid docling-fast --hybrid-mode full paper.pdf -o output/Or using the Python API:
import opendataloader_pdf
opendataloader_pdf.convert( input_path=["paper.pdf"], output_dir="output/", format="json", hybrid="docling-fast", hybrid_mode="full")What you get
Formula extraction output
Each detected formula is output as a JSON element with the LaTeX content and bounding box coordinates:
{ "type": "formula", "page number": 1, "bounding box": [226.2, 144.7, 377.1, 168.7], "content": "\\frac{f(x+h) - f(x)}{h}"}This means you can render the formula in a LaTeX-compatible viewer or keep it as structured data for search indexing.
Chart/image description output
Charts and images get natural language descriptions generated by SmolVLM:
{ "type": "picture", "page number": 3, "bounding box": [72.0, 400.0, 540.0, 600.0], "description": "A bar chart showing waste generation by region from 2016 to 2030. East Asia shows the highest values with a steady upward trend."}These descriptions are full-text searchable — a huge advantage for RAG systems. Instead of searching “I can’t find anything” over binary image data, you can now retrieve chart content by its description.
Important: full hybrid mode is required
Both enrichments only run on the AI backend. If you don’t pass --hybrid-mode full on the client side, the enrichment flags are silently ignored. No warning, no error — just no enrichment. I missed this the first time and wasted 30 minutes debugging.
opendataloader-pdf --hybrid docling-fast --hybrid-mode full paper.pdf -o output/opendataloader-pdf --hybrid docling-fast --hybrid-mode light paper.pdf -o output/The reason this matters
Scientific PDFs are full of non-text content. Standard parsers treat formulas and charts as opaque objects. By enriching the output with LaTeX strings and AI descriptions, you:
- Make math searchable in your RAG index
- Enable accessibility alt text for visually impaired users
- Preserve the semantic meaning of charts and graphs
- Get bounding box coordinates for precise source citation
All of this runs locally on CPU and is free under Apache 2.0.
Summary
In this post, I showed how to extract LaTeX formulas and AI-generated chart descriptions from scientific PDFs using OpenDataLoader hybrid mode. The key point is enabling --enrich-formula and/or --enrich-picture-description on the backend and using --hybrid-mode full on the client — without full mode, enrichment flags are silently ignored.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments