How to Convert PDF to Markdown in Python: MarkItDown Guide
Purpose
PDFs are everywhere. Reports, documentation, research papers - they all come as PDFs. But when I needed to feed PDF content into an LLM for analysis, I hit a wall. LLMs don’t process PDFs directly. I needed to convert them to text, but simple text extraction lost all the structure - tables became meaningless blobs, headings disappeared, and the document hierarchy vanished.
I found MarkItDown. It converts PDFs to Markdown, preserving the document structure while producing output that’s ready for LLM processing. This post shows you how to use it.
Install PDF Support
MarkItDown requires extra dependencies for PDF processing. Install them with:
pip install 'markitdown[pdf]'This pulls in two key libraries:
- pdfplumber: Primary extraction engine for text and tables
- pdfminer.six: Fallback for edge cases
Basic PDF Conversion
Once installed, converting a PDF is straightforward:
from markitdown import MarkItDown
md = MarkItDown()result = md.convert("report.pdf")print(result.text_content)When I ran this on a sample quarterly report, here’s what I got:
# Q4 Financial Report
## Executive Summary
Total revenue increased by 15% compared to Q3, driven primarily byenterprise software sales.
| Metric | Q3 | Q4 | Change ||----------------|---------|---------|--------|| Revenue | $2.1M | $2.4M | +15% || Customers | 1,200 | 1,450 | +21% || Retention | 94% | 96% | +2% |
## Key Highlights
- Launched new API v3.0- Expanded to 5 new markets- Reduced infrastructure costs by 30%The table structure survived. The headings are proper Markdown headers. This is exactly what I needed for my LLM pipeline.
CLI Usage
For quick conversions, use the command line:
# Convert and save to filemarkitdown report.pdf -o report.md
# Pipe to another toolmarkitdown report.pdf | grep "Summary"
# Preview in terminalmarkitdown report.pdf | head -50Comparison with Alternatives
I tried several PDF tools before settling on MarkItDown. Here’s how they compare:
| Tool | Output Format | Tables | Structure | LLM-Ready |
|---|---|---|---|---|
| MarkItDown | Markdown | Preserved | Yes | Yes |
| PyPDF2 | Plain text | Lost | No | No |
| pdfplumber | Structured text | Yes | Partial | No |
| Adobe API | Various | Yes | Yes | No |
PyPDF2 extracts text but loses structure. pdfplumber preserves tables but outputs raw structured text, not Markdown. Adobe’s API works well but adds a paid dependency. MarkItDown gives me Markdown output that’s immediately usable in my RAG pipeline.
Handle Scanned PDFs
For scanned documents or PDFs with complex layouts, basic extraction won’t work. MarkItDown integrates with Azure Document Intelligence for OCR:
import osfrom markitdown import MarkItDown
md = MarkItDown( docintel_endpoint=os.environ.get("AZURE_DOCINTEL_ENDPOINT"))result = md.convert("scanned-invoice.pdf")print(result.text_content)You can also trigger this from the CLI:
markitdown scanned.pdf -d -e "https://your-resource.cognitiveservices.azure.com/"The -d flag enables Document Intelligence, and -e specifies your Azure endpoint. Note that you’ll need an Azure subscription and Document Intelligence resource for this to work.
Common Issues
Missing Text After Conversion
If you get empty output or missing text, check your installation:
# Verify PDF extras are installedpip show markitdown | grep -i pdf
# Reinstall if neededpip install --force-reinstall 'markitdown[pdf]'Tables Not Rendering Correctly
For complex tables, pdfplumber might struggle. Try the Azure Document Intelligence route for better extraction, especially with merged cells or nested tables.
Encoding Issues
Some PDFs use custom encoding. If you see garbled characters:
from markitdown import MarkItDown
md = MarkItDown()result = md.convert("document.pdf")
# Handle encoding explicitlytext = result.text_content.encode('utf-8', errors='replace').decode('utf-8')print(text)When to Use What
- Standard PDFs with text: Use basic MarkItDown conversion
- Scanned documents: Use Azure Document Intelligence integration
- Complex layouts: Consider Azure for better structure detection
- High-volume processing: The basic conversion is fast and works offline
Final Thoughts
PDF to Markdown conversion shouldn’t be hard. MarkItDown makes it simple - install the PDF extras, call convert(), and you get structured Markdown output. The preserved tables and headings make the output immediately usable in LLM pipelines.
For my RAG application, this was exactly what I needed. No more manual text cleanup, no more lost table structures. Just clean Markdown ready for embedding.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments