How to Convert PDF to Markdown for RAG Pipelines with OpenDataLoader

Jun 4, 2026

Purpose

This post demonstrates how to convert PDF documents into clean, structured Markdown suitable for RAG chunking and embedding. The key decision is whether you need Markdown for simple chunking or JSON for element-level control with source citations.

Why Markdown for RAG

LLMs understand structured text better than raw PDF output. Markdown preserves:

Headings → semantic chunk boundaries
Lists → structured information
Tables → relational data

Two Strategies

Markdown Mode (Simple Chunking)

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["document.pdf"],
    output_dir="output/",
    format="markdown"
)

The output is clean text that feeds directly into RecursiveCharacterTextSplitter from LangChain.

JSON Mode (Element-Level Control)

opendataloader_pdf.convert(
    input_path=["document.pdf"],
    output_dir="output/",
    format="json,markdown",
    reading_order="xycut"
)

Each element in JSON output includes type, content, page number, bounding box, and heading level. This enables precise source citations.

Chunking Strategies

The OpenDataLoader example codebase provides three strategies:

By element: One chunk per paragraph/heading/list — fine-grained retrieval, precise citations
By section: Groups content under headings — context-rich retrieval, topic-based search
Merged with minimum size: Combines small paragraphs — balanced chunk sizes, reduced noise

LangChain Integration

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["document.pdf"],
    format="text"
)
documents = loader.load()

Source Citations

The JSON bounding box [left, bottom, right, top] in PDF points (72pt = 1 inch) lets you map each chunk back to its exact location. When your RAG pipeline returns a chunk, you can highlight the exact paragraph, table, or figure in the original PDF.

Summary

In this post, I showed how to convert PDF to Markdown for RAG pipelines using OpenDataLoader PDF. The key point is to use Markdown for simple chunking or JSON for element-level control with bounding-box source citations.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 LangChain OpenDataLoader Integration
👨‍💻 OpenDataLoader PDF GitHub Repository

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!