How to Convert PDF to Markdown for RAG Pipelines with OpenDataLoader
Purpose
This post demonstrates how to convert PDF documents into clean, structured Markdown suitable for RAG chunking and embedding. The key decision is whether you need Markdown for simple chunking or JSON for element-level control with source citations.
Why Markdown for RAG
LLMs understand structured text better than raw PDF output. Markdown preserves:
- Headings → semantic chunk boundaries
- Lists → structured information
- Tables → relational data
Two Strategies
Markdown Mode (Simple Chunking)
import opendataloader_pdf
opendataloader_pdf.convert( input_path=["document.pdf"], output_dir="output/", format="markdown")The output is clean text that feeds directly into RecursiveCharacterTextSplitter from LangChain.
JSON Mode (Element-Level Control)
opendataloader_pdf.convert( input_path=["document.pdf"], output_dir="output/", format="json,markdown", reading_order="xycut")Each element in JSON output includes type, content, page number, bounding box, and heading level. This enables precise source citations.
Chunking Strategies
The OpenDataLoader example codebase provides three strategies:
- By element: One chunk per paragraph/heading/list — fine-grained retrieval, precise citations
- By section: Groups content under headings — context-rich retrieval, topic-based search
- Merged with minimum size: Combines small paragraphs — balanced chunk sizes, reduced noise
LangChain Integration
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader( file_path=["document.pdf"], format="text")documents = loader.load()Source Citations
The JSON bounding box [left, bottom, right, top] in PDF points (72pt = 1 inch) lets you map each chunk back to its exact location. When your RAG pipeline returns a chunk, you can highlight the exact paragraph, table, or figure in the original PDF.
Summary
In this post, I showed how to convert PDF to Markdown for RAG pipelines using OpenDataLoader PDF. The key point is to use Markdown for simple chunking or JSON for element-level control with bounding-box source citations.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments