Skip to content

Should I Use Events or Messages for Multi-Agent LLM Communication? A Practical Guide

My multi-agent LLM system kept corrupting data. Agents were reading partial writes, overwriting each other’s changes, and producing inconsistent results. The culprit? A shared message file that multiple agents were reading and writing concurrently.

Here’s what went wrong and how I fixed it with event-driven architecture.

The Problem: Shared State in Multi-Agent Systems

I started building a 10-agent system for life management with Obsidian. The architecture seemed simple enough: each agent would read from and write to a shared agent-messages.md file. Agent coordination through a single message board.

Initial (Broken) Architecture
┌─────────────┐
│ Agent A │──┐
└─────────────┘ │
┌─────────────┐ │ ┌──────────────────┐
│ Agent B │──┼───▶│ agent-messages.md │
└─────────────┘ │ └──────────────────┘
┌─────────────┐ │
│ Agent C │──┘
└─────────────┘

Then the race conditions started.

Agent A would start writing a research task while Agent B was halfway through reading the file. Agent B would get corrupted state. Agent C would overwrite Agent A’s message before Agent D had a chance to read it. The more agents I added, the worse it got.

Why Message Files Break at Scale

The fundamental issue: concurrent file access is hard.

When Agent A writes to a file at the same time Agent B reads it:

  • Agent B might read partial data (write in progress)
  • Agent B might read stale data (write not yet flushed)
  • Agent A’s write might get lost (another agent wrote simultaneously)

File locking adds complexity and kills performance. Each agent has to wait for locks, and deadlock becomes a real risk.

The Solution: Event-Driven Architecture with Pub/Sub

I replaced direct message passing with an event-driven model using a pub/sub pattern:

Event-Driven Architecture
┌─────────┐ ┌──────────────────┐ ┌─────────┐
│ Agent A │ ──publish─▶│ Event Broker │◀─subscribe─│ Agent B │
└─────────┘ │ │ └─────────┘
│ Event Types: │
┌─────────┐ │ - task.created │ ┌─────────┐
│ Agent C │◀─subscribe │ - task.completed│◀─subscribe─│ Agent D │
└─────────┘ │ - error.occurred│ └─────────┘
└──────────────────┘

The key insight: agents don’t communicate with each other directly. They publish events to a broker and subscribe to event types they care about. The broker handles concurrent access safely.

How I Implemented the Event Bus

I started with a simple file-based event bus for persistence and audit trails:

event_bus.py
from typing import Callable, Dict, List, Any
from dataclasses import dataclass
from datetime import datetime
import json
from pathlib import Path
@dataclass
class Event:
event_type: str
payload: Dict[str, Any]
timestamp: datetime
source: str
class EventBus:
"""Simple event bus with file-based persistence for audit trail."""
def __init__(self, storage_dir: Path):
self.storage_dir = storage_dir
self.subscribers: Dict[str, List[Callable]] = {}
self.storage_dir.mkdir(parents=True, exist_ok=True)
def subscribe(self, event_type: str, handler: Callable):
"""Register a handler for an event type."""
if event_type not in self.subscribers:
self.subscribers[event_type] = []
self.subscribers[event_type].append(handler)
def publish(self, event_type: str, payload: Dict[str, Any], source: str):
"""Publish an event to all subscribers."""
event = Event(
event_type=event_type,
payload=payload,
timestamp=datetime.utcnow(),
source=source
)
# Persist to event log (append-only, no race conditions)
self._append_to_log(event)
# Notify subscribers
handlers = self.subscribers.get(event_type, [])
for handler in handlers:
try:
handler(event)
except Exception as e:
print(f"Handler error for {event_type}: {e}")
def _append_to_log(self, event: Event):
"""Append event to date-partitioned log file."""
date_str = event.timestamp.strftime("%Y-%m-%d")
log_file = self.storage_dir / f"events-{date_str}.jsonl"
with open(log_file, "a") as f:
f.write(json.dumps({
"type": event.event_type,
"payload": event.payload,
"timestamp": event.timestamp.isoformat(),
"source": event.source
}) + "\n")

The append-only log is the secret sauce. Each event gets appended to a JSONL file, which is atomic on most filesystems. No race conditions because agents never read and write the same records simultaneously.

Converting Agents to Event-Driven

Here’s how I converted my PlanningAgent to use events instead of message files:

planning_agent.py
class PlanningAgent:
"""Plans content based on research events."""
def __init__(self, event_bus: EventBus):
self.event_bus = event_bus
# Only subscribe to events this agent needs
event_bus.subscribe("research.completed", self.on_research_done)
event_bus.subscribe("content.rejected", self.on_content_rejected)
def on_research_done(self, event: Event):
"""Handle completed research by creating a plan."""
research_id = event.payload["research_id"]
# Do planning work
plan = self.create_plan(research_id)
# Publish result - don't know or care who consumes
self.event_bus.publish(
"plan.created",
{"plan_id": plan.id, "research_id": research_id},
source="planning_agent"
)
def create_plan(self, research_id: str) -> Plan:
# Implementation details...
pass

And the WriterAgent that consumes planning events:

writer_agent.py
class WriterAgent:
"""Writes content based on plan events."""
def __init__(self, event_bus: EventBus):
event_bus.subscribe("plan.created", self.on_plan_created)
event_bus.subscribe("edit.requested", self.on_edit_requested)
def on_plan_created(self, event: Event):
plan_id = event.payload["plan_id"]
content = self.write_content(plan_id)
self.event_bus.publish(
"content.drafted",
{"content_id": content.id, "plan_id": plan_id},
source="writer_agent"
)

Notice how WriterAgent doesn’t know about PlanningAgent. It only knows about the plan.created event. This decoupling is the key benefit.

Why Decoupling Matters

After switching to events, I could add a monitoring agent without touching any existing code:

monitoring_agent.py
class MonitoringAgent:
"""Tracks metrics across all agent activities."""
def __init__(self, event_bus: EventBus):
# Subscribe to multiple event types
for event_type in ["research.completed", "plan.created", "content.drafted"]:
event_bus.subscribe(event_type, self.track_event)
def track_event(self, event: Event):
# Record metrics, no other agents need to know this exists
metrics.record(event.event_type, event.source, event.timestamp)

This is impossible with direct message passing. Adding a new agent would require modifying every agent that might need to communicate with it.

Mistakes I Made Along the Way

Mistake 1: Treating Events as Synchronous RPC

At first, I tried to make events behave like function calls:

Wrong: Blocking on events
# WRONG: Treating events like remote procedure calls
result = event_bus.publish_and_wait("get_data", payload) # Blocks!

This defeats the entire purpose. Events are asynchronous by design. If you need a response, use a different pattern or make the request/response explicit with correlation IDs.

Mistake 2: Bloated Event Payloads

I initially stuffed everything into events:

Wrong: Over-including data
# WRONG: Including everything
event_bus.publish("task.created", {
"task": task,
"user": user,
"history": all_history, # Too much data!
"related_tasks": related # Unnecessary coupling
})

This couples the producer to the consumer’s needs. If the consumer starts needing more data, the producer has to change. Instead, include only identifiers:

Right: Minimal event payload
# RIGHT: Minimal event with reference
event_bus.publish("task.created", {
"task_id": task.id,
"user_id": user.id
})

Consumers can fetch the data they need from their own stores.

Mistake 3: Event Type Explosion

I created separate event types for every state change:

user.created
user.updated
user.deleted
user.email_changed
user.password_reset
user.profile_updated
...

This becomes unmaintainable. Use a naming convention and group related events:

user.created
user.updated
user.deleted

The user.updated event includes what changed, and consumers decide if they care.

Mistake 4: No Event Versioning

Events change. I didn’t plan for this, and when I needed to add fields, all my consumers broke. Now I include event versions:

Versioned events
event_bus.publish("task.created", {
"version": 2, # Added in version 2
"task_id": task.id,
"user_id": user.id,
"priority": task.priority # New field
})

Consumers check the version and handle missing fields gracefully.

The Results

After migrating to event-driven architecture:

  • Zero race conditions from shared file access
  • Added 3 new agents without modifying existing ones
  • Horizontal scaling - run multiple instances of bottleneck agents
  • Full audit trail - every event logged with timestamp and source
  • Easier testing - test agents in isolation with mock events

When to Use Events vs Messages

Events are the right choice for most multi-agent LLM systems because:

  1. Agents are inherently asynchronous - they process at different speeds
  2. Loose coupling enables iteration - add/remove agents without refactoring
  3. Audit trails matter - understand what happened and when
  4. Race conditions are inevitable with shared state

Use direct messages only when you need synchronous request/response semantics and the coupling is acceptable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments