How to Convert PDFs to Markdown for RAG Pipelines

Mar 4, 2026

Building a RAG pipeline? You need to convert PDFs to clean Markdown for embedding. Most tools are slow, produce poor quality output, or can’t handle complex layouts like tables and headings.

I’ve been there. You find a PDF with important data, convert it to text, and end up with a jumbled mess. Headings get lost in the middle of paragraphs, tables become unreadable grid text, and images disappear entirely.

Here’s how to properly convert PDFs to Markdown for RAG pipelines.

Why Markdown is ideal for RAG

Markdown gives you the structure LLMs need without the token overhead of HTML or XML. Headings (#, ##) create clear document boundaries. Tables use the pipe syntax LLMs recognize naturally.

# Annual Report 2024

## Financial Overview

| Metric | Q1 | Q2 | Q3 | Q4 |
|--------|----|----|----|----|
| Revenue | $1.2M | $1.5M | $1.8M | $2.1M |
| Growth | 15% | 25% | 20% | 16% |

![Chart: Revenue Growth](img_chart_001.png)

Basic PDF to Markdown

The simplest approach uses pdf_oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("research_paper.pdf")

# Convert single page with heading detection
markdown = doc.to_markdown(0, detect_headings=True)
print(markdown)

# Convert all pages
all_markdown = doc.to_markdown_all(detect_headings=True)
with open("output.md", "w") as f:
    f.write(all_markdown)

Advanced Options

For better quality output, use these options:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")

markdown = doc.to_markdown(
    0,
    detect_headings=True,    # Auto-detect H1/H2/H3
    extract_tables=True,     # Format tables as markdown
    preserve_layout=True,    # Keep column alignment
    include_images=True      # Include image references
)

Heading Detection

PDFs don’t have semantic headers. The library detects them by font size:

Font Size	Heading Level
24pt+	H1
18-23pt	H2
14-17pt	H3

Table Extraction

Grid-aligned text becomes proper markdown tables:

markdown = doc.to_markdown(0, extract_tables=True)
# Output:
# | Name  | Age | City |
# |-------|-----|------|
# | Alice | 30  | NYC  |

LangChain Integration

Create a custom document loader:

from langchain.document_loaders import BaseLoader
from langchain.schema import Document
from pdf_oxide import PdfDocument

class PdfOxideLoader(BaseLoader):
    """Load PDFs using pdf_oxide for RAG pipelines."""

    def __init__(self, file_path: str, detect_headings: bool = True):
        self.file_path = file_path
        self.detect_headings = detect_headings

    def load(self) -> list[Document]:
        doc = PdfDocument(self.file_path)
        documents = []

        for i in range(doc.page_count):
            markdown = doc.to_markdown(i, detect_headings=self.detect_headings)
            documents.append(Document(
                page_content=markdown,
                metadata={"source": self.file_path, "page": i}
            ))

        return documents

# Usage
loader = PdfOxideLoader("paper.pdf", detect_headings=True)
docs = loader.load()

Batch Processing for Corpora

Process entire directories:

from pathlib import Path
from pdf_oxide import PdfDocument
from concurrent.futures import ThreadPoolExecutor

def process_for_rag(pdf_path: Path) -> list[dict]:
    """Convert PDF to RAG-ready documents."""
    doc = PdfDocument(str(pdf_path))
    chunks = []

    for i in range(doc.page_count):
        markdown = doc.to_markdown(i, detect_headings=True, extract_tables=True)
        chunks.append({
            "content": markdown,
            "source": str(pdf_path),
            "page": i
        })

    return chunks

# Process directory
pdf_files = list(Path("corpus/").glob("*.pdf"))

with ThreadPoolExecutor(max_workers=8) as executor:
    all_chunks = []
    for chunks in executor.map(process_for_rag, pdf_files):
        all_chunks.extend(chunks)

print(f"Processed {len(all_chunks)} document chunks")

Quality Comparison

Tool	Speed	Quality	Tables
pdf_oxide	0.8ms	High	Yes
pymupdf4llm	55ms	Medium	No
markitdown	108ms	Low	No

Summary

In this post, I showed how to convert PDFs to Markdown for RAG pipelines. The key point is using proper heading detection and table extraction for clean LLM input.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 pdf-oxide Documentation
👨‍💻 LangChain Documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!