Skip to content

How to Convert PDFs to Markdown for RAG Pipelines

Building a RAG pipeline? You need to convert PDFs to clean Markdown for embedding. Most tools are slow, produce poor quality output, or can’t handle complex layouts like tables and headings.

I’ve been there. You find a PDF with important data, convert it to text, and end up with a jumbled mess. Headings get lost in the middle of paragraphs, tables become unreadable grid text, and images disappear entirely.

Here’s how to properly convert PDFs to Markdown for RAG pipelines.

Why Markdown is ideal for RAG

Markdown gives you the structure LLMs need without the token overhead of HTML or XML. Headings (#, ##) create clear document boundaries. Tables use the pipe syntax LLMs recognize naturally.

example_output.md
# Annual Report 2024
## Financial Overview
| Metric | Q1 | Q2 | Q3 | Q4 |
|--------|----|----|----|----|
| Revenue | $1.2M | $1.5M | $1.8M | $2.1M |
| Growth | 15% | 25% | 20% | 16% |
![Chart: Revenue Growth](img_chart_001.png)

Basic PDF to Markdown

The simplest approach uses pdf_oxide:

basic_convert.py
from pdf_oxide import PdfDocument
doc = PdfDocument("research_paper.pdf")
# Convert single page with heading detection
markdown = doc.to_markdown(0, detect_headings=True)
print(markdown)
# Convert all pages
all_markdown = doc.to_markdown_all(detect_headings=True)
with open("output.md", "w") as f:
f.write(all_markdown)

Advanced Options

For better quality output, use these options:

advanced_convert.py
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
markdown = doc.to_markdown(
0,
detect_headings=True, # Auto-detect H1/H2/H3
extract_tables=True, # Format tables as markdown
preserve_layout=True, # Keep column alignment
include_images=True # Include image references
)

Heading Detection

PDFs don’t have semantic headers. The library detects them by font size:

Font SizeHeading Level
24pt+H1
18-23ptH2
14-17ptH3

Table Extraction

Grid-aligned text becomes proper markdown tables:

table_extract.py
markdown = doc.to_markdown(0, extract_tables=True)
# Output:
# | Name | Age | City |
# |-------|-----|------|
# | Alice | 30 | NYC |

LangChain Integration

Create a custom document loader:

langchain_loader.py
from langchain.document_loaders import BaseLoader
from langchain.schema import Document
from pdf_oxide import PdfDocument
class PdfOxideLoader(BaseLoader):
"""Load PDFs using pdf_oxide for RAG pipelines."""
def __init__(self, file_path: str, detect_headings: bool = True):
self.file_path = file_path
self.detect_headings = detect_headings
def load(self) -> list[Document]:
doc = PdfDocument(self.file_path)
documents = []
for i in range(doc.page_count):
markdown = doc.to_markdown(i, detect_headings=self.detect_headings)
documents.append(Document(
page_content=markdown,
metadata={"source": self.file_path, "page": i}
))
return documents
# Usage
loader = PdfOxideLoader("paper.pdf", detect_headings=True)
docs = loader.load()

Batch Processing for Corpora

Process entire directories:

batch_convert.py
from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor
def process_for_rag(pdf_path: Path) -> list[dict]:
"""Convert PDF to RAG-ready documents."""
doc = PdfDocument(str(pdf_path))
chunks = []
for i in range(doc.page_count):
markdown = doc.to_markdown(i, detect_headings=True, extract_tables=True)
chunks.append({
"content": markdown,
"source": str(pdf_path),
"page": i
})
return chunks
# Process directory
pdf_files = list(Path("corpus/").glob("*.pdf"))
with ThreadPoolExecutor(max_workers=8) as executor:
all_chunks = []
for chunks in executor.map(process_for_rag, pdf_files):
all_chunks.extend(chunks)
print(f"Processed {len(all_chunks)} document chunks")

Quality Comparison

ToolSpeedQualityTables
pdf_oxide0.8msHighYes
pymupdf4llm55msMediumNo
markitdown108msLowNo

Summary

In this post, I showed how to convert PDFs to Markdown for RAG pipelines. The key point is using proper heading detection and table extraction for clean LLM input.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments