Should I Use Events or Messages for Multi-Agent LLM Communication? A Practical Guide
My multi-agent LLM system kept corrupting data. Agents were reading partial writes, overwriting each other’s changes, and producing inconsistent results. The culprit? A shared message file that multiple agents were reading and writing concurrently.
Here’s what went wrong and how I fixed it with event-driven architecture.
The Problem: Shared State in Multi-Agent Systems
I started building a 10-agent system for life management with Obsidian. The architecture seemed simple enough: each agent would read from and write to a shared agent-messages.md file. Agent coordination through a single message board.
┌─────────────┐│ Agent A │──┐└─────────────┘ │┌─────────────┐ │ ┌──────────────────┐│ Agent B │──┼───▶│ agent-messages.md │└─────────────┘ │ └──────────────────┘┌─────────────┐ ││ Agent C │──┘└─────────────┘Then the race conditions started.
Agent A would start writing a research task while Agent B was halfway through reading the file. Agent B would get corrupted state. Agent C would overwrite Agent A’s message before Agent D had a chance to read it. The more agents I added, the worse it got.
Why Message Files Break at Scale
The fundamental issue: concurrent file access is hard.
When Agent A writes to a file at the same time Agent B reads it:
- Agent B might read partial data (write in progress)
- Agent B might read stale data (write not yet flushed)
- Agent A’s write might get lost (another agent wrote simultaneously)
File locking adds complexity and kills performance. Each agent has to wait for locks, and deadlock becomes a real risk.
The Solution: Event-Driven Architecture with Pub/Sub
I replaced direct message passing with an event-driven model using a pub/sub pattern:
┌─────────┐ ┌──────────────────┐ ┌─────────┐│ Agent A │ ──publish─▶│ Event Broker │◀─subscribe─│ Agent B │└─────────┘ │ │ └─────────┘ │ Event Types: │┌─────────┐ │ - task.created │ ┌─────────┐│ Agent C │◀─subscribe │ - task.completed│◀─subscribe─│ Agent D │└─────────┘ │ - error.occurred│ └─────────┘ └──────────────────┘The key insight: agents don’t communicate with each other directly. They publish events to a broker and subscribe to event types they care about. The broker handles concurrent access safely.
How I Implemented the Event Bus
I started with a simple file-based event bus for persistence and audit trails:
from typing import Callable, Dict, List, Anyfrom dataclasses import dataclassfrom datetime import datetimeimport jsonfrom pathlib import Path
@dataclassclass Event: event_type: str payload: Dict[str, Any] timestamp: datetime source: str
class EventBus: """Simple event bus with file-based persistence for audit trail."""
def __init__(self, storage_dir: Path): self.storage_dir = storage_dir self.subscribers: Dict[str, List[Callable]] = {} self.storage_dir.mkdir(parents=True, exist_ok=True)
def subscribe(self, event_type: str, handler: Callable): """Register a handler for an event type.""" if event_type not in self.subscribers: self.subscribers[event_type] = [] self.subscribers[event_type].append(handler)
def publish(self, event_type: str, payload: Dict[str, Any], source: str): """Publish an event to all subscribers.""" event = Event( event_type=event_type, payload=payload, timestamp=datetime.utcnow(), source=source )
# Persist to event log (append-only, no race conditions) self._append_to_log(event)
# Notify subscribers handlers = self.subscribers.get(event_type, []) for handler in handlers: try: handler(event) except Exception as e: print(f"Handler error for {event_type}: {e}")
def _append_to_log(self, event: Event): """Append event to date-partitioned log file.""" date_str = event.timestamp.strftime("%Y-%m-%d") log_file = self.storage_dir / f"events-{date_str}.jsonl"
with open(log_file, "a") as f: f.write(json.dumps({ "type": event.event_type, "payload": event.payload, "timestamp": event.timestamp.isoformat(), "source": event.source }) + "\n")The append-only log is the secret sauce. Each event gets appended to a JSONL file, which is atomic on most filesystems. No race conditions because agents never read and write the same records simultaneously.
Converting Agents to Event-Driven
Here’s how I converted my PlanningAgent to use events instead of message files:
class PlanningAgent: """Plans content based on research events."""
def __init__(self, event_bus: EventBus): self.event_bus = event_bus # Only subscribe to events this agent needs event_bus.subscribe("research.completed", self.on_research_done) event_bus.subscribe("content.rejected", self.on_content_rejected)
def on_research_done(self, event: Event): """Handle completed research by creating a plan.""" research_id = event.payload["research_id"]
# Do planning work plan = self.create_plan(research_id)
# Publish result - don't know or care who consumes self.event_bus.publish( "plan.created", {"plan_id": plan.id, "research_id": research_id}, source="planning_agent" )
def create_plan(self, research_id: str) -> Plan: # Implementation details... passAnd the WriterAgent that consumes planning events:
class WriterAgent: """Writes content based on plan events."""
def __init__(self, event_bus: EventBus): event_bus.subscribe("plan.created", self.on_plan_created) event_bus.subscribe("edit.requested", self.on_edit_requested)
def on_plan_created(self, event: Event): plan_id = event.payload["plan_id"] content = self.write_content(plan_id)
self.event_bus.publish( "content.drafted", {"content_id": content.id, "plan_id": plan_id}, source="writer_agent" )Notice how WriterAgent doesn’t know about PlanningAgent. It only knows about the plan.created event. This decoupling is the key benefit.
Why Decoupling Matters
After switching to events, I could add a monitoring agent without touching any existing code:
class MonitoringAgent: """Tracks metrics across all agent activities."""
def __init__(self, event_bus: EventBus): # Subscribe to multiple event types for event_type in ["research.completed", "plan.created", "content.drafted"]: event_bus.subscribe(event_type, self.track_event)
def track_event(self, event: Event): # Record metrics, no other agents need to know this exists metrics.record(event.event_type, event.source, event.timestamp)This is impossible with direct message passing. Adding a new agent would require modifying every agent that might need to communicate with it.
Mistakes I Made Along the Way
Mistake 1: Treating Events as Synchronous RPC
At first, I tried to make events behave like function calls:
# WRONG: Treating events like remote procedure callsresult = event_bus.publish_and_wait("get_data", payload) # Blocks!This defeats the entire purpose. Events are asynchronous by design. If you need a response, use a different pattern or make the request/response explicit with correlation IDs.
Mistake 2: Bloated Event Payloads
I initially stuffed everything into events:
# WRONG: Including everythingevent_bus.publish("task.created", { "task": task, "user": user, "history": all_history, # Too much data! "related_tasks": related # Unnecessary coupling})This couples the producer to the consumer’s needs. If the consumer starts needing more data, the producer has to change. Instead, include only identifiers:
# RIGHT: Minimal event with referenceevent_bus.publish("task.created", { "task_id": task.id, "user_id": user.id})Consumers can fetch the data they need from their own stores.
Mistake 3: Event Type Explosion
I created separate event types for every state change:
user.createduser.updateduser.deleteduser.email_changeduser.password_resetuser.profile_updated...This becomes unmaintainable. Use a naming convention and group related events:
user.createduser.updateduser.deletedThe user.updated event includes what changed, and consumers decide if they care.
Mistake 4: No Event Versioning
Events change. I didn’t plan for this, and when I needed to add fields, all my consumers broke. Now I include event versions:
event_bus.publish("task.created", { "version": 2, # Added in version 2 "task_id": task.id, "user_id": user.id, "priority": task.priority # New field})Consumers check the version and handle missing fields gracefully.
The Results
After migrating to event-driven architecture:
- Zero race conditions from shared file access
- Added 3 new agents without modifying existing ones
- Horizontal scaling - run multiple instances of bottleneck agents
- Full audit trail - every event logged with timestamp and source
- Easier testing - test agents in isolation with mock events
When to Use Events vs Messages
Events are the right choice for most multi-agent LLM systems because:
- Agents are inherently asynchronous - they process at different speeds
- Loose coupling enables iteration - add/remove agents without refactoring
- Audit trails matter - understand what happened and when
- Race conditions are inevitable with shared state
Use direct messages only when you need synchronous request/response semantics and the coupling is acceptable.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments