What is Docling Agent for Agentic Document Operations?
The Problem
I needed to build a document processing pipeline for invoice extraction. The traditional approach looked like this:
PDF Input → Parse with fixed rules → Regex for fields → Template matching → Manual edge case handling → Separate tool for each task → Fragile, breaks when format changesEvery time a vendor changed their invoice format, my extraction rules broke. Adding new document types meant writing new parsers from scratch. Editing documents required separate tools. Generating reports needed template systems.
The Reddit community confirmed this frustration:
- Hardcoded extraction rules that break on format changes- Template-based generation that lacks flexibility- Manual editing workflows with no automation- Separate tools for each task (extract, edit, generate)- No unified document representation across operationsThen I found Docling Agent. It promised something different: natural language-driven document operations.
What is Docling Agent?
Docling Agent is an AI-powered framework that enables agentic document operations. Instead of fixed rules, you use natural language prompts to tell an AI agent what to do with documents.
TRADITIONAL: "Extract invoice_number using regex pattern [A-Z]{2}-[0-9]{6}" → Breaks when vendor changes format to INV-123456
AGENTIC: "Extract the invoice number from this document" → Agent finds it regardless of format, position, or styleThe key innovation: DoclingDocument, a unified format that all operations share.
The DoclingDocument Format
Before understanding the agents, I needed to understand DoclingDocument. It’s the foundation.
┌─────────────────────────────────────────────────────────────┐│ DoclingDocument │├─────────────────────────────────────────────────────────────┤│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Headers │ │ Text │ │ Tables │ ││ │ (h1-h6) │ │ (paragraphs│ │ (with cells│ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Pictures │ │ Lists │ │ Footnotes │ ││ │ (images) │ │ (ordered/ │ │ │ ││ │ │ │ unordered) │ │ │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ ││ Serialization: JSON | Export: Markdown, HTML, JSON │└─────────────────────────────────────────────────────────────┘This unified format means:
+ Preserves hierarchy and structure+ All operations work on same representation+ Can save to multiple formats from one source+ Agent operations don't lose document contextThe basic usage:
from docling_core.types.doc import DoclingDocument
# Load existing documentdoc = DoclingDocument.load_from_json("document.json")
# Iterate through elementsfor element, level in doc.iterate_items(): # element can be: text, table, picture, header, list print(f"Level {level}: {element}")
# Export to multiple formatsdoc.save_as_html("output.html")doc.save_as_markdown("output.md")doc.save_as_json("output.json")Four Agent Types
Docling Agent provides four specialized agents. Each handles a different document operation.
1. DoclingWritingAgent
Creates new documents from natural language prompts.
Natural Language Prompt → Agent interprets intent → Generates DoclingDocument → Structured content with hierarchy → Export to any formatExample use:
from docling_agent import DoclingWritingAgent
agent = DoclingWritingAgent(model_id="granite-7b")
result = agent.run( prompt="Create a quarterly sales report with sections for revenue, costs, and projections")
# Export the generated documentresult.document.save_as_markdown("report.md")When to use: Generate reports from scratch, create documentation templates, produce structured content from outlines.
2. DoclingEditingAgent
Applies targeted modifications to existing documents.
Existing DoclingDocument + Natural Language Task → Agent identifies targets → Applies modifications → Returns modified documentExample:
from docling_agent import DoclingEditingAgentfrom docling_core.types.doc import DoclingDocument
doc = DoclingDocument.load_from_json("report.json")
agent = DoclingEditingAgent()result = agent.run( document=doc, task="Add a summary section at the beginning and improve table formatting")
result.document.save_as_markdown("report_improved.md")When to use: Refine table structures, add missing sections, fix formatting inconsistencies.
3. DoclingExtractingAgent
Extracts structured data using schema definitions.
PDF/Image → Convert to DoclingDocument → Define schema with field types → Agent extracts matching fields → Returns typed dataExample:
from docling_agent import DoclingExtractingAgentfrom docling.document_converter import DocumentConverter
# Convert PDF firstconverter = DocumentConverter()doc = converter.convert("invoice.pdf").document
# Define what to extractschema = { "invoice_number": "string", "vendor": "string", "total": "number", "date": "date", "line_items": "array"}
# Extractagent = DoclingExtractingAgent()extracted = agent.run(document=doc, schema=schema)
print(f"Invoice #{extracted.invoice_number}")print(f"Total: {extracted.total}")When to use: Invoice data extraction, resume parsing, form field extraction, research paper metadata.
4. DoclingEnrichingAgent
Adds metadata and annotations to documents.
DoclingDocument → Agent analyzes content → Adds summaries, keywords, entities → Returns enriched document → Search-ready, classifiedExample:
from docling_agent import DoclingEnrichingAgent
agent = DoclingEnrichingAgent()result = agent.run( document=research_paper_doc, tasks=["summarize", "extract_keywords", "identify_entities", "classify_sections"])
# Now the document has:# - Summary in metadata# - Search keywords attached# - Key entities identified# - Sections classified by typeWhen to use: Add document summaries, generate search keywords, identify key entities, classify content.
Why Local Execution Matters
One of the key advantages: runs completely locally.
CLOUD PROCESSING: Document → Upload to API → Processed on remote servers → Download result → Data leaves your infrastructure → Privacy concerns → API costs
LOCAL PROCESSING: Document → Process on your machine → Result stays local → Data never leaves → Privacy preserved → No API costsThis matters for:
+ Privacy-sensitive documents (contracts, medical records)+ Air-gapped environments (military, healthcare)+ Compliance with data regulations (GDPR, HIPAA)+ No per-document API costs+ Faster batch processing (no network latency)Local setup:
from docling.datamodel.pipeline_options import PdfPipelineOptions
# Point to locally downloaded modelspipeline_options = PdfPipelineOptions( artifacts_path="/local/path/to/models")
# No network calls neededModel support: OpenAI GPT variants, IBM Granite, model-agnostic via Mellea integration.
Combining Agents in a Pipeline
The real power comes from chaining agents.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ PDF Invoice │────▶│ Convert │────▶│ DoclingDoc │└──────────────┘ └──────────────┘ └──────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ │ Extract │────▶│ Structured │ │ Agent │ │ Data │ └──────────────┘ └──────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ │ Enrich │────▶│ Searchable │ │ Agent │ │ Invoice │ └──────────────┘ └──────────────┘Complete workflow:
from docling.document_converter import DocumentConverterfrom docling_agent import DoclingExtractingAgent, DoclingEnrichingAgent
# Step 1: Convert PDFconverter = DocumentConverter()doc = converter.convert("invoice.pdf").document
# Step 2: Extract dataschema = {"invoice_number": "string", "total": "number", "vendor": "string"}extract_agent = DoclingExtractingAgent()data = extract_agent.run(document=doc, schema=schema)
# Step 3: Enrich for searchenrich_agent = DoclingEnrichingAgent()enriched_doc = enrich_agent.run( document=doc, tasks=["summarize", "extract_keywords"])
# Save both outputsenriched_doc.save_as_json("invoice_enriched.json")# data contains extracted fieldsCurrent Status and Limitations
Important caveat: the project is still early-stage.
Status: "still immature and work-in-progress"Availability: Public repository on GitHubDirection: Docling moving beyond conversion to full document operationsWhat this means:
+ Core functionality works (convert, basic extraction)- API examples may change- Documentation still evolving- Chunkless RAG mentioned but not documented yet- Some conceptual examples based on documented patternsThe Reddit discussion captured this:
“Still early stage but the direction is clear, Docling is moving beyond conversion.”
Summary
Docling Agent represents a shift from rigid document processing to natural language-driven operations:
1. DoclingDocument: Unified format for all operations2. Four agents: Write, Edit, Extract, Enrich3. Natural language: Define tasks in prompts, not code4. Local execution: Privacy-preserving, no API costs5. Composable: Chain agents for complex pipelinesThe traditional approach required separate tools for each task, hardcoded rules that broke on format changes, and no unified document representation. Docling Agent solves this with natural language prompts and a single document format that flows through all operations.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments