Skip to content

How to Create MarkItDown Plugins: Custom Document Converter Guide

Purpose

I had a proprietary file format at work - internal configuration files with a .conf extension that our tools generated. These files contained structured data in a custom format, and I needed to extract their content as Markdown for documentation. MarkItDown didn’t support this format out of the box, but its plugin architecture let me create a custom converter. This post shows how I built a MarkItDown plugin from scratch, registered it properly, and avoided the common pitfalls along the way.

Understanding the Plugin Architecture

MarkItDown uses a plugin system based on Python entry points. When you install a plugin package, MarkItDown automatically discovers and loads it. The core interface is the DocumentConverter class:

base-converter.py
class DocumentConverter:
"""Abstract superclass of all DocumentConverters."""
def accepts(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> bool:
"""Return True if this converter can handle the document."""
raise NotImplementedError()
def convert(self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs) -> DocumentConverterResult:
"""Convert a document to Markdown text."""
raise NotImplementedError()

Your plugin needs to:

  1. Subclass DocumentConverter
  2. Implement accepts() to detect supported formats
  3. Implement convert() to transform content to Markdown
  4. Register via the markitdown.plugin entry point

Step 1: Create the Converter Class

I started by creating a converter for my custom format. Here’s a simplified example:

my_plugin/converter.py
from markitdown import DocumentConverter, DocumentConverterResult
from markitdown._stream_info import StreamInfo
from typing import BinaryIO, Any
class MyCustomConverter(DocumentConverter):
"""Converter for proprietary .myformat files."""
def accepts(
self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs
) -> bool:
"""Check if this converter can handle the file."""
# Check by file extension
if stream_info.extension == ".myformat":
return True
# Check by MIME type
if stream_info.mimetype == "application/x-myformat":
return True
# Check file content (magic bytes)
cur_pos = file_stream.tell()
header = file_stream.read(8)
file_stream.seek(cur_pos) # IMPORTANT: Reset stream position
return header.startswith(b"MYFORMAT1")
def convert(
self, file_stream: BinaryIO, stream_info: StreamInfo, **kwargs
) -> DocumentConverterResult:
"""Convert the file to Markdown."""
content = file_stream.read()
# Parse and convert your format
markdown = self._parse_to_markdown(content)
return DocumentConverterResult(
markdown=markdown,
title="Custom Document"
)
def _parse_to_markdown(self, content: bytes) -> str:
"""Convert binary content to Markdown string."""
# Your custom parsing logic here
text = content.decode("utf-8")
lines = text.split("\n")
result = []
for line in lines:
if line.startswith("TITLE:"):
result.append(f"# {line[6:].strip()}\n")
elif line.startswith("SECTION:"):
result.append(f"\n## {line[8:].strip()}\n")
else:
result.append(line)
return "\n".join(result)

The critical parts are:

  • accepts() returns True only for files this converter can handle
  • convert() returns a DocumentConverterResult with the Markdown output
  • Always reset stream position in accepts() after reading

Step 2: Create the Plugin Registration Function

Next, I created the plugin module that registers converters:

my_plugin/__init__.py
from markitdown import MarkItDown
from .converter import MyCustomConverter
def register_converters(markitdown: MarkItDown, **kwargs):
"""Register converters with the MarkItDown instance.
This function is called by MarkItDown when loading plugins.
"""
markitdown.register_converter(
MyCustomConverter(),
priority=0.0 # Same level as built-in converters
)

The priority parameter controls converter order:

  • -1.0: Run before built-in converters (useful for OCR plugins)
  • 0.0: Same priority as built-ins (last registered wins)
  • 10.0: Fallback if no other converter handles the file

Step 3: Configure Package Entry Point

I created a proper Python package with a pyproject.toml file:

pyproject.toml
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "markitdown-myformat"
version = "0.1.0"
description = "MarkItDown plugin for .myformat files"
requires-python = ">=3.10"
dependencies = [
"markitdown>=0.0.1",
]
[project.entry-points."markitdown.plugin"]
my_plugin = "my_plugin:register_converters"

The entry point format is:

  • Group: markitdown.plugin
  • Name: Your plugin identifier (e.g., my_plugin)
  • Target: module:function (e.g., my_plugin:register_converters)

Step 4: Package Structure

My final package structure looked like this:

package-structure.txt
markitdown-myformat/
├── pyproject.toml
├── README.md
├── src/
│ └── my_plugin/
│ ├── __init__.py
│ └── converter.py
└── tests/
└── test_converter.py

Step 5: Install and Test

I installed the plugin in development mode:

install-dev.sh
pip install -e .

Then verified the plugin was loaded:

test-plugin.sh
# List all installed plugins
markitdown --list-plugins
# Convert a file using plugins
markitdown --use-plugins document.myformat

From Python, the plugin loads automatically:

test-python.py
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.myformat")
print(result.text_content)

Priority System Deep Dive

The priority system determines which converter handles a file when multiple converters accept it:

PriorityUse Case
-1.0Pre-processing (e.g., OCR before text extraction)
0.0Standard converters (built-ins use this)
10.0Fallback converters

For example, the OCR plugin uses -1.0 to process images before other converters:

ocr-priority.py
# OCR plugin runs first to extract text from images
markitdown.register_converter(
OCRConverter(),
priority=-1.0
)

If you want your converter to override built-ins for certain formats, register with priority 0.0 but load after the built-in. Since MarkItDown uses “last registered wins” at the same priority, load order matters.

Common Mistakes

I made several mistakes while developing my plugin. Here’s how to avoid them:

1. Forgetting to reset stream position in accepts()

stream-reset.py
# WRONG: Stream position not reset
def accepts(self, file_stream, stream_info, **kwargs):
header = file_stream.read(8)
return header.startswith(b"MYFORMAT")
# Bug: convert() gets stream at wrong position!
# RIGHT: Always reset after reading
def accepts(self, file_stream, stream_info, **kwargs):
cur_pos = file_stream.tell()
header = file_stream.read(8)
file_stream.seek(cur_pos) # Reset to original position
return header.startswith(b"MYFORMAT")

2. Not handling missing dependencies gracefully

If your converter needs optional libraries, handle the import error:

optional-deps.py
# WRONG: Hard crash if dependency missing
import proprietary_lib
def convert(self, file_stream, stream_info, **kwargs):
return proprietary_lib.parse(file_stream)
# RIGHT: Helpful error message
def convert(self, file_stream, stream_info, **kwargs):
try:
from proprietary_lib import parse
except ImportError:
raise MissingDependencyException(
"Install with: pip install markitdown-myformat[full]"
)
return parse(file_stream)

3. Accepting files you cannot actually convert

Be precise in your accepts() method. Don’t return True for files you can’t handle:

precise-accepts.py
# WRONG: Too broad, accepts all .json files
def accepts(self, file_stream, stream_info, **kwargs):
return stream_info.extension == ".json"
# RIGHT: Check actual content
def accepts(self, file_stream, stream_info, **kwargs):
if stream_info.extension != ".json":
return False
cur_pos = file_stream.tell()
try:
import json
data = json.load(file_stream)
file_stream.seek(cur_pos)
# Only accept if it has our required structure
return "myformat_version" in data
except:
file_stream.seek(cur_pos)
return False

4. Not returning a proper title

The DocumentConverterResult should include a meaningful title:

title-handling.py
# WRONG: Generic or missing title
return DocumentConverterResult(markdown=markdown)
# RIGHT: Extract or generate a title
return DocumentConverterResult(
markdown=markdown,
title=extracted_title or "Custom Document"
)

Testing Your Plugin

Write tests that use binary streams, not just file paths:

tests/test_converter.py
import io
from my_plugin.converter import MyCustomConverter
from markitdown._stream_info import StreamInfo
def test_converter_accepts():
converter = MyCustomConverter()
stream_info = StreamInfo(extension=".myformat", mimetype=None)
content = b"MYFORMAT1\ntest content"
stream = io.BytesIO(content)
assert converter.accepts(stream, stream_info) is True
def test_converter_rejects():
converter = MyCustomConverter()
stream_info = StreamInfo(extension=".pdf", mimetype=None)
content = b"%PDF-1.4" # PDF header
stream = io.BytesIO(content)
assert converter.accepts(stream, stream_info) is False
def test_converter_output():
converter = MyCustomConverter()
stream_info = StreamInfo(extension=".myformat", mimetype=None)
content = b"TITLE:Test\nSECTION:Introduction\nHello world"
stream = io.BytesIO(content)
result = converter.convert(stream, stream_info)
assert "# Test" in result.markdown
assert "## Introduction" in result.markdown

Summary

Creating a MarkItDown plugin involves three main steps: implement the DocumentConverter interface with accepts() and convert() methods, create a registration function, and configure the entry point. The key details are resetting stream positions in accepts(), handling dependencies gracefully, and choosing the right priority for your use case. With this architecture, you can extend MarkItDown to handle any proprietary or custom file format while benefiting from its unified conversion pipeline.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments