Skip to content

How to Convert PDF to Markdown in Python: MarkItDown Guide

Purpose

PDFs are everywhere. Reports, documentation, research papers - they all come as PDFs. But when I needed to feed PDF content into an LLM for analysis, I hit a wall. LLMs don’t process PDFs directly. I needed to convert them to text, but simple text extraction lost all the structure - tables became meaningless blobs, headings disappeared, and the document hierarchy vanished.

I found MarkItDown. It converts PDFs to Markdown, preserving the document structure while producing output that’s ready for LLM processing. This post shows you how to use it.

Install PDF Support

MarkItDown requires extra dependencies for PDF processing. Install them with:

install-pdf.sh
pip install 'markitdown[pdf]'

This pulls in two key libraries:

  • pdfplumber: Primary extraction engine for text and tables
  • pdfminer.six: Fallback for edge cases

Basic PDF Conversion

Once installed, converting a PDF is straightforward:

basic_conversion.py
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("report.pdf")
print(result.text_content)

When I ran this on a sample quarterly report, here’s what I got:

sample_output.md
# Q4 Financial Report
## Executive Summary
Total revenue increased by 15% compared to Q3, driven primarily by
enterprise software sales.
| Metric | Q3 | Q4 | Change |
|----------------|---------|---------|--------|
| Revenue | $2.1M | $2.4M | +15% |
| Customers | 1,200 | 1,450 | +21% |
| Retention | 94% | 96% | +2% |
## Key Highlights
- Launched new API v3.0
- Expanded to 5 new markets
- Reduced infrastructure costs by 30%

The table structure survived. The headings are proper Markdown headers. This is exactly what I needed for my LLM pipeline.

CLI Usage

For quick conversions, use the command line:

cli_usage.sh
# Convert and save to file
markitdown report.pdf -o report.md
# Pipe to another tool
markitdown report.pdf | grep "Summary"
# Preview in terminal
markitdown report.pdf | head -50

Comparison with Alternatives

I tried several PDF tools before settling on MarkItDown. Here’s how they compare:

ToolOutput FormatTablesStructureLLM-Ready
MarkItDownMarkdownPreservedYesYes
PyPDF2Plain textLostNoNo
pdfplumberStructured textYesPartialNo
Adobe APIVariousYesYesNo

PyPDF2 extracts text but loses structure. pdfplumber preserves tables but outputs raw structured text, not Markdown. Adobe’s API works well but adds a paid dependency. MarkItDown gives me Markdown output that’s immediately usable in my RAG pipeline.

Handle Scanned PDFs

For scanned documents or PDFs with complex layouts, basic extraction won’t work. MarkItDown integrates with Azure Document Intelligence for OCR:

azure_ocr.py
import os
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint=os.environ.get("AZURE_DOCINTEL_ENDPOINT")
)
result = md.convert("scanned-invoice.pdf")
print(result.text_content)

You can also trigger this from the CLI:

azure_cli.sh
markitdown scanned.pdf -d -e "https://your-resource.cognitiveservices.azure.com/"

The -d flag enables Document Intelligence, and -e specifies your Azure endpoint. Note that you’ll need an Azure subscription and Document Intelligence resource for this to work.

Common Issues

Missing Text After Conversion

If you get empty output or missing text, check your installation:

check_install.sh
# Verify PDF extras are installed
pip show markitdown | grep -i pdf
# Reinstall if needed
pip install --force-reinstall 'markitdown[pdf]'

Tables Not Rendering Correctly

For complex tables, pdfplumber might struggle. Try the Azure Document Intelligence route for better extraction, especially with merged cells or nested tables.

Encoding Issues

Some PDFs use custom encoding. If you see garbled characters:

encoding_fix.py
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
# Handle encoding explicitly
text = result.text_content.encode('utf-8', errors='replace').decode('utf-8')
print(text)

When to Use What

  • Standard PDFs with text: Use basic MarkItDown conversion
  • Scanned documents: Use Azure Document Intelligence integration
  • Complex layouts: Consider Azure for better structure detection
  • High-volume processing: The basic conversion is fast and works offline

Final Thoughts

PDF to Markdown conversion shouldn’t be hard. MarkItDown makes it simple - install the PDF extras, call convert(), and you get structured Markdown output. The preserved tables and headings make the output immediately usable in LLM pipelines.

For my RAG application, this was exactly what I needed. No more manual text cleanup, no more lost table structures. Just clean Markdown ready for embedding.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments