How to Create MarkItDown Plugins: Custom Document Converter Guide
Purpose
I had a proprietary file format at work - internal configuration files with a .conf extension that our tools generated. These files contained structured data in a custom format, and I needed to extract their content as Markdown for documentation. MarkItDown didn’t support this format out of the box, but its plugin architecture let me create a custom converter. This post shows how I built a MarkItDown plugin from scratch, registered it properly, and avoided the common pitfalls along the way.
Understanding the Plugin Architecture
MarkItDown uses a plugin system based on Python entry points. When you install a plugin package, MarkItDown automatically discovers and loads it. The core interface is the DocumentConverter class:
class DocumentConverter: """Abstract superclass of all DocumentConverters."""
def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool: """Return True if this converter can handle the document.""" raise NotImplementedError()
def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> DocumentConverterResult: """Convert a document to Markdown text.""" raise NotImplementedError()Your plugin needs to:
- Subclass
DocumentConverter - Implement
accepts()to detect supported formats - Implement
convert()to transform content to Markdown - Register via the
markitdown.pluginentry point
Step 1: Create the Converter Class
I started by creating a converter for my custom format. Here’s a simplified example:
from markitdown import DocumentConverter, DocumentConverterResultfrom markitdown._stream_info import StreamInfofrom typing import BinaryIO, Any
class MyCustomConverter(DocumentConverter): """Converter for proprietary .myformat files."""
def accepts( self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs ) -> bool: """Check if this converter can handle the file.""" # Check by file extension if stream_info.extension == ".myformat": return True
# Check by MIME type if stream_info.mimetype == "application/x-myformat": return True
# Check file content (magic bytes) cur_pos = file_stream.tell() header = file_stream.read(8) file_stream.seek(cur_pos) # IMPORTANT: Reset stream position
return header.startswith(b"MYFORMAT1")
def convert( self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs ) -> DocumentConverterResult: """Convert the file to Markdown.""" content = file_stream.read()
# Parse and convert your format markdown = self._parse_to_markdown(content)
return DocumentConverterResult( markdown=markdown, title="Custom Document" )
def _parse_to_markdown(self, content: bytes) -> str: """Convert binary content to Markdown string.""" # Your custom parsing logic here text = content.decode("utf-8") lines = text.split("\n")
result = [] for line in lines: if line.startswith("TITLE:"): result.append(f"# {line[6:].strip()}\n") elif line.startswith("SECTION:"): result.append(f"\n## {line[8:].strip()}\n") else: result.append(line)
return "\n".join(result)The critical parts are:
accepts()returnsTrueonly for files this converter can handleconvert()returns aDocumentConverterResultwith the Markdown output- Always reset stream position in
accepts()after reading
Step 2: Create the Plugin Registration Function
Next, I created the plugin module that registers converters:
from markitdown import MarkItDownfrom .converter import MyCustomConverter
def register_converters(markitdown: MarkItDown, **kwargs): """Register converters with the MarkItDown instance.
This function is called by MarkItDown when loading plugins. """ markitdown.register_converter( MyCustomConverter(), priority=0.0 # Same level as built-in converters )The priority parameter controls converter order:
-1.0: Run before built-in converters (useful for OCR plugins)0.0: Same priority as built-ins (last registered wins)10.0: Fallback if no other converter handles the file
Step 3: Configure Package Entry Point
I created a proper Python package with a pyproject.toml file:
[build-system]requires = ["setuptools>=61.0"]build-backend = "setuptools.build_meta"
[project]name = "markitdown-myformat"version = "0.1.0"description = "MarkItDown plugin for .myformat files"requires-python = ">=3.10"dependencies = [ "markitdown>=0.0.1",]
[project.entry-points."markitdown.plugin"]my_plugin = "my_plugin:register_converters"The entry point format is:
- Group:
markitdown.plugin - Name: Your plugin identifier (e.g.,
my_plugin) - Target:
module:function(e.g.,my_plugin:register_converters)
Step 4: Package Structure
My final package structure looked like this:
markitdown-myformat/├── pyproject.toml├── README.md├── src/│ └── my_plugin/│ ├── __init__.py│ └── converter.py└── tests/ └── test_converter.pyStep 5: Install and Test
I installed the plugin in development mode:
pip install -e .Then verified the plugin was loaded:
# List all installed pluginsmarkitdown --list-plugins
# Convert a file using pluginsmarkitdown --use-plugins document.myformatFrom Python, the plugin loads automatically:
from markitdown import MarkItDown
md = MarkItDown()result = md.convert("document.myformat")print(result.text_content)Priority System Deep Dive
The priority system determines which converter handles a file when multiple converters accept it:
| Priority | Use Case |
|---|---|
| -1.0 | Pre-processing (e.g., OCR before text extraction) |
| 0.0 | Standard converters (built-ins use this) |
| 10.0 | Fallback converters |
For example, the OCR plugin uses -1.0 to process images before other converters:
# OCR plugin runs first to extract text from imagesmarkitdown.register_converter( OCRConverter(), priority=-1.0)If you want your converter to override built-ins for certain formats, register with priority 0.0 but load after the built-in. Since MarkItDown uses “last registered wins” at the same priority, load order matters.
Common Mistakes
I made several mistakes while developing my plugin. Here’s how to avoid them:
1. Forgetting to reset stream position in accepts()
# WRONG: Stream position not resetdef accepts(self, file_stream, stream_info, **kwargs): header = file_stream.read(8) return header.startswith(b"MYFORMAT") # Bug: convert() gets stream at wrong position!
# RIGHT: Always reset after readingdef accepts(self, file_stream, stream_info, **kwargs): cur_pos = file_stream.tell() header = file_stream.read(8) file_stream.seek(cur_pos) # Reset to original position return header.startswith(b"MYFORMAT")2. Not handling missing dependencies gracefully
If your converter needs optional libraries, handle the import error:
# WRONG: Hard crash if dependency missingimport proprietary_lib
def convert(self, file_stream, stream_info, **kwargs): return proprietary_lib.parse(file_stream)
# RIGHT: Helpful error messagedef convert(self, file_stream, stream_info, **kwargs): try: from proprietary_lib import parse except ImportError: raise MissingDependencyException( "Install with: pip install markitdown-myformat[full]" ) return parse(file_stream)3. Accepting files you cannot actually convert
Be precise in your accepts() method. Don’t return True for files you can’t handle:
# WRONG: Too broad, accepts all .json filesdef accepts(self, file_stream, stream_info, **kwargs): return stream_info.extension == ".json"
# RIGHT: Check actual contentdef accepts(self, file_stream, stream_info, **kwargs): if stream_info.extension != ".json": return False
cur_pos = file_stream.tell() try: import json data = json.load(file_stream) file_stream.seek(cur_pos) # Only accept if it has our required structure return "myformat_version" in data except: file_stream.seek(cur_pos) return False4. Not returning a proper title
The DocumentConverterResult should include a meaningful title:
# WRONG: Generic or missing titlereturn DocumentConverterResult(markdown=markdown)
# RIGHT: Extract or generate a titlereturn DocumentConverterResult( markdown=markdown, title=extracted_title or "Custom Document")Testing Your Plugin
Write tests that use binary streams, not just file paths:
import iofrom my_plugin.converter import MyCustomConverterfrom markitdown._stream_info import StreamInfo
def test_converter_accepts(): converter = MyCustomConverter() stream_info = StreamInfo(extension=".myformat", mimetype=None)
content = b"MYFORMAT1\ntest content" stream = io.BytesIO(content)
assert converter.accepts(stream, stream_info) is True
def test_converter_rejects(): converter = MyCustomConverter() stream_info = StreamInfo(extension=".pdf", mimetype=None)
content = b"%PDF-1.4" # PDF header stream = io.BytesIO(content)
assert converter.accepts(stream, stream_info) is False
def test_converter_output(): converter = MyCustomConverter() stream_info = StreamInfo(extension=".myformat", mimetype=None)
content = b"TITLE:Test\nSECTION:Introduction\nHello world" stream = io.BytesIO(content)
result = converter.convert(stream, stream_info) assert "# Test" in result.markdown assert "## Introduction" in result.markdownSummary
Creating a MarkItDown plugin involves three main steps: implement the DocumentConverter interface with accepts() and convert() methods, create a registration function, and configure the entry point. The key details are resetting stream positions in accepts(), handling dependencies gracefully, and choosing the right priority for your use case. With this architecture, you can extend MarkItDown to handle any proprietary or custom file format while benefiting from its unified conversion pipeline.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments