Skip to content

What is Docling Agent for Agentic Document Operations?

The Problem

I needed to build a document processing pipeline for invoice extraction. The traditional approach looked like this:

Traditional Document Processing Pipeline
PDF Input → Parse with fixed rules → Regex for fields → Template matching
→ Manual edge case handling → Separate tool for each task
→ Fragile, breaks when format changes

Every time a vendor changed their invoice format, my extraction rules broke. Adding new document types meant writing new parsers from scratch. Editing documents required separate tools. Generating reports needed template systems.

The Reddit community confirmed this frustration:

Common Document Processing Pain Points
- Hardcoded extraction rules that break on format changes
- Template-based generation that lacks flexibility
- Manual editing workflows with no automation
- Separate tools for each task (extract, edit, generate)
- No unified document representation across operations

Then I found Docling Agent. It promised something different: natural language-driven document operations.

What is Docling Agent?

Docling Agent is an AI-powered framework that enables agentic document operations. Instead of fixed rules, you use natural language prompts to tell an AI agent what to do with documents.

Agentic vs Traditional Approach
TRADITIONAL:
"Extract invoice_number using regex pattern [A-Z]{2}-[0-9]{6}"
→ Breaks when vendor changes format to INV-123456
AGENTIC:
"Extract the invoice number from this document"
→ Agent finds it regardless of format, position, or style

The key innovation: DoclingDocument, a unified format that all operations share.

The DoclingDocument Format

Before understanding the agents, I needed to understand DoclingDocument. It’s the foundation.

DoclingDocument Structure
┌─────────────────────────────────────────────────────────────┐
│ DoclingDocument │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Headers │ │ Text │ │ Tables │ │
│ │ (h1-h6) │ │ (paragraphs│ │ (with cells│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Pictures │ │ Lists │ │ Footnotes │ │
│ │ (images) │ │ (ordered/ │ │ │ │
│ │ │ │ unordered) │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Serialization: JSON | Export: Markdown, HTML, JSON │
└─────────────────────────────────────────────────────────────┘

This unified format means:

Format Benefits
+ Preserves hierarchy and structure
+ All operations work on same representation
+ Can save to multiple formats from one source
+ Agent operations don't lose document context

The basic usage:

DoclingDocument Basics
from docling_core.types.doc import DoclingDocument
# Load existing document
doc = DoclingDocument.load_from_json("document.json")
# Iterate through elements
for element, level in doc.iterate_items():
# element can be: text, table, picture, header, list
print(f"Level {level}: {element}")
# Export to multiple formats
doc.save_as_html("output.html")
doc.save_as_markdown("output.md")
doc.save_as_json("output.json")

Four Agent Types

Docling Agent provides four specialized agents. Each handles a different document operation.

1. DoclingWritingAgent

Creates new documents from natural language prompts.

Writing Agent Workflow
Natural Language Prompt → Agent interprets intent → Generates DoclingDocument
→ Structured content with hierarchy → Export to any format

Example use:

Writing Agent Example
from docling_agent import DoclingWritingAgent
agent = DoclingWritingAgent(model_id="granite-7b")
result = agent.run(
prompt="Create a quarterly sales report with sections for revenue, costs, and projections"
)
# Export the generated document
result.document.save_as_markdown("report.md")

When to use: Generate reports from scratch, create documentation templates, produce structured content from outlines.

2. DoclingEditingAgent

Applies targeted modifications to existing documents.

Editing Agent Workflow
Existing DoclingDocument + Natural Language Task → Agent identifies targets
→ Applies modifications
→ Returns modified document

Example:

Editing Agent Example
from docling_agent import DoclingEditingAgent
from docling_core.types.doc import DoclingDocument
doc = DoclingDocument.load_from_json("report.json")
agent = DoclingEditingAgent()
result = agent.run(
document=doc,
task="Add a summary section at the beginning and improve table formatting"
)
result.document.save_as_markdown("report_improved.md")

When to use: Refine table structures, add missing sections, fix formatting inconsistencies.

3. DoclingExtractingAgent

Extracts structured data using schema definitions.

Extraction Agent Workflow
PDF/Image → Convert to DoclingDocument → Define schema with field types
→ Agent extracts matching fields → Returns typed data

Example:

Extraction Agent Example
from docling_agent import DoclingExtractingAgent
from docling.document_converter import DocumentConverter
# Convert PDF first
converter = DocumentConverter()
doc = converter.convert("invoice.pdf").document
# Define what to extract
schema = {
"invoice_number": "string",
"vendor": "string",
"total": "number",
"date": "date",
"line_items": "array"
}
# Extract
agent = DoclingExtractingAgent()
extracted = agent.run(document=doc, schema=schema)
print(f"Invoice #{extracted.invoice_number}")
print(f"Total: {extracted.total}")

When to use: Invoice data extraction, resume parsing, form field extraction, research paper metadata.

4. DoclingEnrichingAgent

Adds metadata and annotations to documents.

Enrichment Agent Workflow
DoclingDocument → Agent analyzes content → Adds summaries, keywords, entities
→ Returns enriched document → Search-ready, classified

Example:

Enrichment Agent Example
from docling_agent import DoclingEnrichingAgent
agent = DoclingEnrichingAgent()
result = agent.run(
document=research_paper_doc,
tasks=["summarize", "extract_keywords", "identify_entities", "classify_sections"]
)
# Now the document has:
# - Summary in metadata
# - Search keywords attached
# - Key entities identified
# - Sections classified by type

When to use: Add document summaries, generate search keywords, identify key entities, classify content.

Why Local Execution Matters

One of the key advantages: runs completely locally.

Local vs Cloud Processing
CLOUD PROCESSING:
Document → Upload to API → Processed on remote servers → Download result
→ Data leaves your infrastructure → Privacy concerns → API costs
LOCAL PROCESSING:
Document → Process on your machine → Result stays local
→ Data never leaves → Privacy preserved → No API costs

This matters for:

Local Execution Benefits
+ Privacy-sensitive documents (contracts, medical records)
+ Air-gapped environments (military, healthcare)
+ Compliance with data regulations (GDPR, HIPAA)
+ No per-document API costs
+ Faster batch processing (no network latency)

Local setup:

Local Model Configuration
from docling.datamodel.pipeline_options import PdfPipelineOptions
# Point to locally downloaded models
pipeline_options = PdfPipelineOptions(
artifacts_path="/local/path/to/models"
)
# No network calls needed

Model support: OpenAI GPT variants, IBM Granite, model-agnostic via Mellea integration.

Combining Agents in a Pipeline

The real power comes from chaining agents.

Invoice Processing Pipeline
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PDF Invoice │────▶│ Convert │────▶│ DoclingDoc │
└──────────────┘ └──────────────┘ └──────────────┘
┌──────────────┐ ┌──────────────┐
│ Extract │────▶│ Structured │
│ Agent │ │ Data │
└──────────────┘ └──────────────┘
┌──────────────┐ ┌──────────────┐
│ Enrich │────▶│ Searchable │
│ Agent │ │ Invoice │
└──────────────┘ └──────────────┘

Complete workflow:

Complete Pipeline Example
from docling.document_converter import DocumentConverter
from docling_agent import DoclingExtractingAgent, DoclingEnrichingAgent
# Step 1: Convert PDF
converter = DocumentConverter()
doc = converter.convert("invoice.pdf").document
# Step 2: Extract data
schema = {"invoice_number": "string", "total": "number", "vendor": "string"}
extract_agent = DoclingExtractingAgent()
data = extract_agent.run(document=doc, schema=schema)
# Step 3: Enrich for search
enrich_agent = DoclingEnrichingAgent()
enriched_doc = enrich_agent.run(
document=doc,
tasks=["summarize", "extract_keywords"]
)
# Save both outputs
enriched_doc.save_as_json("invoice_enriched.json")
# data contains extracted fields

Current Status and Limitations

Important caveat: the project is still early-stage.

Current Status
Status: "still immature and work-in-progress"
Availability: Public repository on GitHub
Direction: Docling moving beyond conversion to full document operations

What this means:

What to Expect
+ Core functionality works (convert, basic extraction)
- API examples may change
- Documentation still evolving
- Chunkless RAG mentioned but not documented yet
- Some conceptual examples based on documented patterns

The Reddit discussion captured this:

“Still early stage but the direction is clear, Docling is moving beyond conversion.”

Summary

Docling Agent represents a shift from rigid document processing to natural language-driven operations:

Key Takeaways
1. DoclingDocument: Unified format for all operations
2. Four agents: Write, Edit, Extract, Enrich
3. Natural language: Define tasks in prompts, not code
4. Local execution: Privacy-preserving, no API costs
5. Composable: Chain agents for complex pipelines

The traditional approach required separate tools for each task, hardcoded rules that broke on format changes, and no unified document representation. Docling Agent solves this with natural language prompts and a single document format that flows through all operations.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments