How to Convert PDFs to Markdown for RAG Pipelines
Building a RAG pipeline? You need to convert PDFs to clean Markdown for embedding. Most tools are slow, produce poor quality output, or can’t handle complex layouts like tables and headings.
I’ve been there. You find a PDF with important data, convert it to text, and end up with a jumbled mess. Headings get lost in the middle of paragraphs, tables become unreadable grid text, and images disappear entirely.
Here’s how to properly convert PDFs to Markdown for RAG pipelines.
Why Markdown is ideal for RAG
Markdown gives you the structure LLMs need without the token overhead of HTML or XML. Headings (#, ##) create clear document boundaries. Tables use the pipe syntax LLMs recognize naturally.
# Annual Report 2024
## Financial Overview
| Metric | Q1 | Q2 | Q3 | Q4 ||--------|----|----|----|----|| Revenue | $1.2M | $1.5M | $1.8M | $2.1M || Growth | 15% | 25% | 20% | 16% |
Basic PDF to Markdown
The simplest approach uses pdf_oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("research_paper.pdf")
# Convert single page with heading detectionmarkdown = doc.to_markdown(0, detect_headings=True)print(markdown)
# Convert all pagesall_markdown = doc.to_markdown_all(detect_headings=True)with open("output.md", "w") as f: f.write(all_markdown)Advanced Options
For better quality output, use these options:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
markdown = doc.to_markdown( 0, detect_headings=True, # Auto-detect H1/H2/H3 extract_tables=True, # Format tables as markdown preserve_layout=True, # Keep column alignment include_images=True # Include image references)Heading Detection
PDFs don’t have semantic headers. The library detects them by font size:
| Font Size | Heading Level |
|---|---|
| 24pt+ | H1 |
| 18-23pt | H2 |
| 14-17pt | H3 |
Table Extraction
Grid-aligned text becomes proper markdown tables:
markdown = doc.to_markdown(0, extract_tables=True)# Output:# | Name | Age | City |# |-------|-----|------|# | Alice | 30 | NYC |LangChain Integration
Create a custom document loader:
from langchain.document_loaders import BaseLoaderfrom langchain.schema import Documentfrom pdf_oxide import PdfDocument
class PdfOxideLoader(BaseLoader): """Load PDFs using pdf_oxide for RAG pipelines."""
def __init__(self, file_path: str, detect_headings: bool = True): self.file_path = file_path self.detect_headings = detect_headings
def load(self) -> list[Document]: doc = PdfDocument(self.file_path) documents = []
for i in range(doc.page_count): markdown = doc.to_markdown(i, detect_headings=self.detect_headings) documents.append(Document( page_content=markdown, metadata={"source": self.file_path, "page": i} ))
return documents
# Usageloader = PdfOxideLoader("paper.pdf", detect_headings=True)docs = loader.load()Batch Processing for Corpora
Process entire directories:
from pathlib import Pathfrom pdf_oxide import PdfDocumentfrom concurrent.futures import ThreadPoolExecutor
def process_for_rag(pdf_path: Path) -> list[dict]: """Convert PDF to RAG-ready documents.""" doc = PdfDocument(str(pdf_path)) chunks = []
for i in range(doc.page_count): markdown = doc.to_markdown(i, detect_headings=True, extract_tables=True) chunks.append({ "content": markdown, "source": str(pdf_path), "page": i })
return chunks
# Process directorypdf_files = list(Path("corpus/").glob("*.pdf"))
with ThreadPoolExecutor(max_workers=8) as executor: all_chunks = [] for chunks in executor.map(process_for_rag, pdf_files): all_chunks.extend(chunks)
print(f"Processed {len(all_chunks)} document chunks")Quality Comparison
| Tool | Speed | Quality | Tables |
|---|---|---|---|
| pdf_oxide | 0.8ms | High | Yes |
| pymupdf4llm | 55ms | Medium | No |
| markitdown | 108ms | Low | No |
Summary
In this post, I showed how to convert PDFs to Markdown for RAG pipelines. The key point is using proper heading detection and table extraction for clean LLM input.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments