Should I Use Events or Messages for Multi-Agent LLM Communication? A Practical Guide

Mar 22, 2026

My multi-agent LLM system kept corrupting data. Agents were reading partial writes, overwriting each other’s changes, and producing inconsistent results. The culprit? A shared message file that multiple agents were reading and writing concurrently.

Here’s what went wrong and how I fixed it with event-driven architecture.

The Problem: Shared State in Multi-Agent Systems

I started building a 10-agent system for life management with Obsidian. The architecture seemed simple enough: each agent would read from and write to a shared agent-messages.md file. Agent coordination through a single message board.

┌─────────────┐
│   Agent A   │──┐
└─────────────┘  │
┌─────────────┐  │    ┌──────────────────┐
│   Agent B   │──┼───▶│ agent-messages.md │
└─────────────┘  │    └──────────────────┘
┌─────────────┐  │
│   Agent C   │──┘
└─────────────┘

Then the race conditions started.

Agent A would start writing a research task while Agent B was halfway through reading the file. Agent B would get corrupted state. Agent C would overwrite Agent A’s message before Agent D had a chance to read it. The more agents I added, the worse it got.

Why Message Files Break at Scale

The fundamental issue: concurrent file access is hard.

When Agent A writes to a file at the same time Agent B reads it:

Agent B might read partial data (write in progress)
Agent B might read stale data (write not yet flushed)
Agent A’s write might get lost (another agent wrote simultaneously)

File locking adds complexity and kills performance. Each agent has to wait for locks, and deadlock becomes a real risk.

The Solution: Event-Driven Architecture with Pub/Sub

I replaced direct message passing with an event-driven model using a pub/sub pattern:

┌─────────┐            ┌──────────────────┐            ┌─────────┐
│ Agent A │ ──publish─▶│   Event Broker   │◀─subscribe─│ Agent B │
└─────────┘            │                  │            └─────────┘
                       │  Event Types:    │
┌─────────┐            │  - task.created  │            ┌─────────┐
│ Agent C │◀─subscribe │  - task.completed│◀─subscribe─│ Agent D │
└─────────┘            │  - error.occurred│            └─────────┘
                       └──────────────────┘

The key insight: agents don’t communicate with each other directly. They publish events to a broker and subscribe to event types they care about. The broker handles concurrent access safely.

How I Implemented the Event Bus

I started with a simple file-based event bus for persistence and audit trails:

from typing import Callable, Dict, List, Any
from dataclasses import dataclass
from datetime import datetime
import json
from pathlib import Path

@dataclass
class Event:
    event_type: str
    payload: Dict[str, Any]
    timestamp: datetime
    source: str

class EventBus:
    """Simple event bus with file-based persistence for audit trail."""

    def __init__(self, storage_dir: Path):
        self.storage_dir = storage_dir
        self.subscribers: Dict[str, List[Callable]] = {}
        self.storage_dir.mkdir(parents=True, exist_ok=True)

    def subscribe(self, event_type: str, handler: Callable):
        """Register a handler for an event type."""
        if event_type not in self.subscribers:
            self.subscribers[event_type] = []
        self.subscribers[event_type].append(handler)

    def publish(self, event_type: str, payload: Dict[str, Any], source: str):
        """Publish an event to all subscribers."""
        event = Event(
            event_type=event_type,
            payload=payload,
            timestamp=datetime.utcnow(),
            source=source
        )

        # Persist to event log (append-only, no race conditions)
        self._append_to_log(event)

        # Notify subscribers
        handlers = self.subscribers.get(event_type, [])
        for handler in handlers:
            try:
                handler(event)
            except Exception as e:
                print(f"Handler error for {event_type}: {e}")

    def _append_to_log(self, event: Event):
        """Append event to date-partitioned log file."""
        date_str = event.timestamp.strftime("%Y-%m-%d")
        log_file = self.storage_dir / f"events-{date_str}.jsonl"

        with open(log_file, "a") as f:
            f.write(json.dumps({
                "type": event.event_type,
                "payload": event.payload,
                "timestamp": event.timestamp.isoformat(),
                "source": event.source
            }) + "\n")

The append-only log is the secret sauce. Each event gets appended to a JSONL file, which is atomic on most filesystems. No race conditions because agents never read and write the same records simultaneously.

Converting Agents to Event-Driven

Here’s how I converted my PlanningAgent to use events instead of message files:

class PlanningAgent:
    """Plans content based on research events."""

    def __init__(self, event_bus: EventBus):
        self.event_bus = event_bus
        # Only subscribe to events this agent needs
        event_bus.subscribe("research.completed", self.on_research_done)
        event_bus.subscribe("content.rejected", self.on_content_rejected)

    def on_research_done(self, event: Event):
        """Handle completed research by creating a plan."""
        research_id = event.payload["research_id"]

        # Do planning work
        plan = self.create_plan(research_id)

        # Publish result - don't know or care who consumes
        self.event_bus.publish(
            "plan.created",
            {"plan_id": plan.id, "research_id": research_id},
            source="planning_agent"
        )

    def create_plan(self, research_id: str) -> Plan:
        # Implementation details...
        pass

And the WriterAgent that consumes planning events:

class WriterAgent:
    """Writes content based on plan events."""

    def __init__(self, event_bus: EventBus):
        event_bus.subscribe("plan.created", self.on_plan_created)
        event_bus.subscribe("edit.requested", self.on_edit_requested)

    def on_plan_created(self, event: Event):
        plan_id = event.payload["plan_id"]
        content = self.write_content(plan_id)

        self.event_bus.publish(
            "content.drafted",
            {"content_id": content.id, "plan_id": plan_id},
            source="writer_agent"
        )

Notice how WriterAgent doesn’t know about PlanningAgent. It only knows about the plan.created event. This decoupling is the key benefit.

Why Decoupling Matters

After switching to events, I could add a monitoring agent without touching any existing code:

class MonitoringAgent:
    """Tracks metrics across all agent activities."""

    def __init__(self, event_bus: EventBus):
        # Subscribe to multiple event types
        for event_type in ["research.completed", "plan.created", "content.drafted"]:
            event_bus.subscribe(event_type, self.track_event)

    def track_event(self, event: Event):
        # Record metrics, no other agents need to know this exists
        metrics.record(event.event_type, event.source, event.timestamp)

This is impossible with direct message passing. Adding a new agent would require modifying every agent that might need to communicate with it.

Mistakes I Made Along the Way

Mistake 1: Treating Events as Synchronous RPC

At first, I tried to make events behave like function calls:

# WRONG: Treating events like remote procedure calls
result = event_bus.publish_and_wait("get_data", payload)  # Blocks!

This defeats the entire purpose. Events are asynchronous by design. If you need a response, use a different pattern or make the request/response explicit with correlation IDs.

Mistake 2: Bloated Event Payloads

I initially stuffed everything into events:

# WRONG: Including everything
event_bus.publish("task.created", {
    "task": task,
    "user": user,
    "history": all_history,  # Too much data!
    "related_tasks": related  # Unnecessary coupling
})

This couples the producer to the consumer’s needs. If the consumer starts needing more data, the producer has to change. Instead, include only identifiers:

# RIGHT: Minimal event with reference
event_bus.publish("task.created", {
    "task_id": task.id,
    "user_id": user.id
})

Consumers can fetch the data they need from their own stores.

Mistake 3: Event Type Explosion

I created separate event types for every state change:

user.created
user.updated
user.deleted
user.email_changed
user.password_reset
user.profile_updated
...

This becomes unmaintainable. Use a naming convention and group related events:

user.created
user.updated
user.deleted

The user.updated event includes what changed, and consumers decide if they care.

Mistake 4: No Event Versioning

Events change. I didn’t plan for this, and when I needed to add fields, all my consumers broke. Now I include event versions:

event_bus.publish("task.created", {
    "version": 2,  # Added in version 2
    "task_id": task.id,
    "user_id": user.id,
    "priority": task.priority  # New field
})

Consumers check the version and handle missing fields gracefully.

The Results

After migrating to event-driven architecture:

Zero race conditions from shared file access
Added 3 new agents without modifying existing ones
Horizontal scaling - run multiple instances of bottleneck agents
Full audit trail - every event logged with timestamp and source
Easier testing - test agents in isolation with mock events

When to Use Events vs Messages

Events are the right choice for most multi-agent LLM systems because:

Agents are inherently asynchronous - they process at different speeds
Loose coupling enables iteration - add/remove agents without refactoring
Audit trails matter - understand what happened and when
Race conditions are inevitable with shared state

Use direct messages only when you need synchronous request/response semantics and the coupling is acceptable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Martin Fowler - Event-Driven Architecture Patterns

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!