Multi-Agent Systems Explained: How Teams of AI Agents Collaborate
Mei-Lin Zhang
ML researcher focused on autonomous agents and multi-agent systems.
The single-agent paradigm is hitting a ceiling. When you ask one LLM to plan, research, write, review, and debug simultaneously, you're fighting against context window limits, role confusion, and the ...
Multi-Agent AI Systems: Architectures, Protocols, and Production Realities
The single-agent paradigm is hitting a ceiling. When you ask one LLM to plan, research, write, review, and debug simultaneously, you're fighting against context window limits, role confusion, and the fundamental constraint that one model can't effectively challenge its own assumptions. Multi-agent systems address this by decomposing cognitive work across specialized agents — but the engineering challenges are substantial, and the hype-to-reality ratio remains uncomfortably high.
This is a technical deep dive into how multi-agent systems actually work, where the frameworks stand today, and what you should know before building one.
Why Multi-Agent, Not Just a Bigger Prompt?
Before diving into architectures, it's worth establishing when multi-agent systems actually make sense versus when you're adding complexity for its own sake.
Multi-agent systems provide genuine value when:
- Tasks require different reasoning strategies (e.g., creative generation followed by rigorous validation)
- You need adversarial checks — one agent's output audited by another with different incentives
- Work can be parallelized across independent subtasks
- Domain expertise needs to be isolated to prevent context contamination
- The problem has natural role boundaries (researcher, writer, editor, fact-checker)
They're overkill when:
- A single well-prompted agent with good tooling handles the task
- Latency and cost are primary constraints (each agent call multiplies both)
- The coordination overhead exceeds the task complexity
- You're simulating collaboration that doesn't add epistemic value
The honest truth: many production multi-agent systems are just single-agent workflows with extra steps and extra tokens. The architecture has to earn its complexity.
Core Architectures
1. Debate / Adversarial Architecture
In this model, agents take opposing positions and a judge (or consensus mechanism) evaluates the arguments. The key insight is that LLMs are better at critiquing than at generating perfect output on the first try.
┌─────────────┐ ┌─────────────┐
│ Agent A │ │ Agent B │
│ (Position) │◄───►│ (Counter) │
└──────┬───────┘ └──────┬───────┘
│ │
└────────┬───────────┘
▼
┌────────────────┐
│ Judge/Verifier │
└────────────────┘
Where it works well: Factual claims, code review, strategic decision-making, legal reasoning. The OpenAI debate research (Irving et al.) showed that adversarial setups can surface errors that single-pass generation misses entirely.
The catch: Debate only works when agents can meaningfully disagree. If they share the same model, same system prompt, and same temperature, you often get performative disagreement — agents that argue superficially but converge on the same blind spots. Real adversarial value requires either different models, different knowledge bases, or explicitly constrained perspectives.
2. Hierarchical Architecture
A manager agent decomposes tasks, delegates to specialist agents, and synthesizes results. This mirrors organizational structures and is the most common pattern in production.
┌─────────────────────┐
│ Manager Agent │
│ (Planning/Delegation)│
└──────────┬──────────┘
┌─────┼──────┐
▼ ▼ ▼
┌────────┐┌────────┐┌────────┐
│Research││ Code ││ Review │
│ Agent ││ Agent ││ Agent │
└────────┘└────────┘└────────┘
Advantages: Clear accountability, natural task decomposition, easier to debug (you can trace which agent produced which output).
Disadvantages: The manager becomes a bottleneck and single point of failure. If the manager's planning is poor, no amount of specialist competence will save the workflow. Manager agents also consume significant tokens on orchestration logic that doesn't directly contribute to the final output.
3. Swarm / Peer-to-Peer Architecture
Agents communicate laterally without a central coordinator. Each agent has local knowledge and local objectives, and emergent behavior arises from their interactions. This is inspired by ant colonies, flocking algorithms, and distributed systems more broadly.
┌────────┐◄──►┌────────┐
│Agent A │ │Agent B │
└───┬────┘ └────┬───┘
│ ╲ │
│ ╲ │
▼ ▼ ▼
┌────────┐◄──►┌────────┐
│Agent C │ │Agent D │
└────────┘ └────────┘
Where it shines: Exploration tasks where you want coverage rather than convergence — multiple research agents exploring different aspects of a problem, coding agents trying different approaches to the same bug.
The reality check: True swarm intelligence requires careful mechanism design. Without it, you get chaotic token-burning where agents talk past each other. Most "swarm" implementations in practice are actually hierarchical systems with the hierarchy abstracted away. The OpenAI Swarm framework, for instance, is really about agent handoffs, not emergent swarm behavior.
Communication Protocols
The architecture defines who talks to whom. The communication protocol defines how.
Message Passing
The simplest approach: agents send structured messages to each other, either directly or through a router. Each message has a sender, recipient, content, and metadata.
from pydantic import BaseModel
from typing import Literal
class AgentMessage(BaseModel):
sender: str
recipient: str
content: str
message_type: Literal["task", "result", "question", "feedback"]
context: dict = {}
parent_id: str | None = None # For tracking conversation threads
Advantage: Simple, debuggable, each agent maintains its own context.
Disadvantage: Context fragmentation — agents may make decisions based on incomplete information because they didn't receive relevant messages. The more agents you have, the harder it is to ensure information flows to where it's needed.
Shared Blackboard
A common workspace where agents read and write information. Think of it as a shared document or database that all agents can access.
class Blackboard:
def __init__(self):
self.state: dict = {}
self.history: list[dict] = []
self.locks: dict[str, str] = {} # agent_id -> resource
def write(self, agent_id: str, key: str, value: any):
self.state[key] = value
self.history.append({
"agent": agent_id,
"action": "write",
"key": key,
"timestamp": time.time()
})
def read(self, key: str) -> any:
return self.state.get(key)
def claim(self, agent_id: str, key: str) -> bool:
"""Prevent other agents from modifying a resource."""
if key in self.locks and self.locks[key] != agent_id:
return False
self.locks[key] = agent_id
return True
Advantage: Agents have access to all relevant information. Easier to maintain consistency.
Disadvantage: Becomes a coordination nightmare at scale. Locking mechanisms add complexity. Agents can overwrite each other's work. Context windows still limit how much of the blackboard any single agent can process.
Event-Driven / Pub-Sub
Agents publish events to channels, and other agents subscribe to channels they care about. This decouples producers from consumers.
import asyncio
from collections import defaultdict
class EventBus:
def __init__(self):
self.subscribers: dict[str, list[callable]] = defaultdict(list)
self.event_log: list[dict] = []
def subscribe(self, event_type: str, handler: callable):
self.subscribers[event_type].append(handler)
async def publish(self, event_type: str, data: dict, source: str):
self.event_log.append({
"type": event_type,
"data": data,
"source": source,
"timestamp": asyncio.get_event_loop().time()
})
handlers = self.subscribers.get(event_type, [])
await asyncio.gather(*[h(data) for h in handlers])
Advantage: Highly scalable, naturally supports parallel execution, easy to add new agents without modifying existing ones.
Disadvantage: Harder to reason about system behavior. Debugging requires tracing events across multiple subscribers. No guaranteed ordering.
Framework Deep Dives
CrewAI: Role-Based Orchestration
CrewAI models multi-agent collaboration as a "crew" of agents with defined roles, goals, and backstories. It's opinionated toward hierarchical workflows and is the most accessible framework for developers new to multi-agent systems.
from crewai import Agent, Task, Crew, Process
# Define agents with distinct roles
researcher = Agent(
role="Senior Research Analyst",
goal="Find comprehensive, accurate information on the given topic",
backstory="""You are an experienced research analyst with a
background in technology journalism. You prioritize primary
sources and verify claims across multiple references.""",
tools=[search_tool, scrape_tool],
verbose=True,
allow_delegation=False
)
writer = Agent(
role="Technical Writer",
goal="Write clear, engaging content based on research provided",
backstory="""You are a technical writer who excels at making
complex topics accessible. You use concrete examples and
avoid jargon unless necessary.""",
verbose=True,
allow_delegation=False
)
editor = Agent(
role="Senior Editor",
goal="Ensure accuracy, clarity, and consistency",
backstory="""You are a meticulous editor with 15 years of
experience. You check factual claims, improve flow, and
ensure the piece serves the target audience.""",
verbose=True,
allow_delegation=True # Editor can send work back
)
# Define tasks with dependencies
research_task = Task(
description="Research the current state of multi-agent AI systems",
expected_output="A structured research brief with key findings and sources",
agent=researcher
)
writing_task = Task(
description="Write a 1500-word article based on the research",
expected_output="A complete draft article in markdown",
agent=writer,
context=[research_task] # Depends on research output
)
editing_task = Task(
description="Review and edit the article for publication",
expected_output="A polished, publication-ready article",
agent=editor,
context=[writing_task]
)
# Assemble and run the crew
crew = Crew(
agents=[researcher, writer, editor],
tasks=[research_task, writing_task, editing_task],
process=Process.sequential, # Tasks execute in order
verbose=True
)
result = crew.kickoff()
What CrewAI gets right:
- The role/goal/backstory abstraction is intuitive and maps well to how we think about team composition
- Task dependencies with
contextparameter create clean data flow - The
allow_delegationflag gives fine-grained control over agent autonomy - Built-in memory and planning features reduce boilerplate
Where CrewAI struggles:
- Sequential processes are well-supported; parallel execution is less mature
- Error handling is brittle — if one agent fails, the whole crew can stall
- Token costs escalate quickly because each agent maintains its own conversation history
- The "backstory" approach to agent behavior is surprisingly effective but also surprisingly unpredictable — small wording changes in backstories can dramatically alter agent behavior
- Limited support for truly dynamic workflows where the task graph changes at runtime
CrewAI's sequential nature makes it best suited for pipeline-style workflows: research → draft → review → publish. It's less suited for iterative refinement loops or competitive evaluation.
AutoGen: Conversational Multi-Agent Framework
Microsoft's AutoGen takes a fundamentally different approach. Agents communicate through conversational turns, and the framework supports complex conversation patterns including group chats, nested conversations, and conditional flows.
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager
# Create specialized agents
planner = AssistantAgent(
name="Planner",
system_message="""You are a project planner. Given a task, break it
down into concrete steps. Output a numbered plan. Be specific about
what each step requires.""",
llm_config={"model": "gpt-4", "temperature": 0.3}
)
coder = AssistantAgent(
name="Coder",
system_message="""You are a Python developer. Write clean, tested code.
When you receive a plan step, implement it. Always include type hints
and docstrings. If you need clarification, ask.""",
llm_config={"model": "gpt-4", "temperature": 0.2}
)
reviewer = AssistantAgent(
name="Reviewer",
system_message="""You are a code reviewer. Check for bugs, security
issues, performance problems, and style violations. Be specific about
issues and suggest fixes. If the code looks good, say APPROVED.""",
llm_config={"model": "gpt-4", "temperature": 0.1}
)
# UserProxy executes code and provides human feedback
executor = UserProxyAgent(
name="Executor",
human_input_mode="NEVER", # Auto-approve for demo
code_execution_config={
"work_dir": "workspace",
"use_docker": True, # Sandboxed execution
}
)
# Group chat with all agents
group_chat = GroupChat(
agents=[planner, coder, reviewer, executor],
messages=[],
max_round=15,
speaker_selection_method="auto" # Manager decides who speaks next
)
manager = GroupChatManager(
groupchat=group_chat,
llm_config={"model": "gpt-4"}
)
# Initiate the conversation
executor.initiate_chat(
manager,
message="""Build a CLI tool that monitors a directory for new files
and logs their metadata (size, creation time, file type) to a SQLite
database. Include a query mode to search the database."""
)
What AutoGen gets right:
- The conversational model is natural and flexible — agents can interrupt, ask questions, and iterate
GroupChatManagerwithspeaker_selection_method="auto"lets the LLM decide conversation flow, which is powerful for dynamic tasks- Code execution is first-class — the
UserProxyAgentcan actually run code, see errors, and feed them back - Nested conversations allow sub-discussions without polluting the main thread
- The
speaker_selection_methodoptions (auto,round_robin,random, or custom function) give real control over conversation dynamics
Where AutoGen struggles:
- Group chats can become disorganized — agents repeat themselves, go off-topic, or fail to converge
- The
max_roundparameter is a blunt instrument for controlling conversation length - Token costs in group chats are brutal because every agent sees every message
- Speaker selection with
automode sometimes leads to the wrong agent responding - Error recovery is limited — if an agent produces bad output, there's no built-in mechanism to detect and correct it beyond other agents noticing
AutoGen's strength is in open-ended, exploratory tasks where the conversation needs to evolve. It's particularly good for code generation tasks because of the integrated execution environment.
Framework Comparison
| Feature | CrewAI | AutoGen |
|---|---|---|
| Architecture | Hierarchical/pipeline | Conversational/group |
| Best for | Structured workflows | Exploratory tasks |
| Code execution | Via tools | Built-in (UserProxy) |
| Parallel agents | Limited | Supported via GroupChat |
| Learning curve | Low | Medium |
| Cost control | Better (sequential) | Harder (group chat) |
| Dynamic routing | Limited | Strong (speaker selection) |
| Production readiness | Moderate | Moderate |
| Community size | Large, growing | Large (Microsoft backing) |
Real-World Applications
Software Development Pipelines
The most mature application of multi-agent systems. A planning agent decomposes requirements, a coding agent implements, a testing agent writes and runs tests, and a review agent checks quality. This isn't hypothetical — GitHub Copilot Workspace and Cursor are moving in this direction.
The key insight from production deployments: the review agent matters more than the coding agent. An excellent coder paired with a mediocre reviewer produces worse output than a mediocre coder paired with an excellent reviewer. Invest your best prompts and models in the verification layer.
Research and Analysis
Multiple research agents explore different facets of a question, then a synthesis agent combines findings. This is effective when the research space is broad and you want coverage. Perplexity and similar tools are essentially single-agent systems, but the multi-agent approach can catch contradictions and fill gaps that a single pass misses.
Customer Support Escalation
A triage agent classifies incoming requests, routes to specialized agents (billing, technical, sales), and a quality agent monitors conversations for escalation signals. This is one of the few areas where multi-agent systems are genuinely in production at scale, though most implementations use traditional routing with LLM augmentation rather than fully autonomous agent collaboration.
Financial Analysis
Competing analyst agents evaluate a trade thesis from bull and bear perspectives, a risk agent quantifies exposure, and a compliance agent checks regulatory constraints. The adversarial structure here is natural and valuable — forcing explicit counterarguments improves decision quality.
The Hard Problems Nobody Talks About
Token Cost Multiplication
A three-agent system doesn't cost 3x a single agent — it often costs 8-15x because of repeated context, coordination messages, and the fact that agents tend to be verbose. A CrewAI crew with three agents running a research-write-edit pipeline can easily consume 50,000+ tokens for a task that a single agent could handle in 8,000 tokens. You need to be certain the quality improvement justifies the cost.
Failure Mode Cascading
When one agent produces bad output, downstream agents often amplify the error rather than catching it. This is especially problematic in sequential pipelines where later agents trust earlier outputs. The mitigation is explicit verification steps, but these add latency and cost.
Non-Determinism Compounds
A single LLM call is non-deterministic. A multi-agent system with five calls is wildly non-deterministic. The same input can produce dramatically different outputs across runs. This makes debugging, testing, and quality assurance extremely difficult. There's no good solution yet — you're essentially accepting that multi-agent systems are probabilistic and building evaluation frameworks around that reality.
Observability Is an Afterthought
When your system has five agents exchanging twenty messages, understanding why a particular output was produced requires tracing the entire conversation graph. Most frameworks have rudimentary logging at best. You'll likely need to build custom observability tooling — structured logging of every agent interaction, conversation graph visualization, and token usage tracking per agent.
# What you'll probably end up building yourself
class AgentTracer:
def __init__(self):
self.traces: list[dict] = []
def log_interaction(self,
agent_id: str,
input_messages: list[dict],
output: str,
tokens_used: int,
latency_ms: float,
tool_calls: list[dict] = None):
self.traces.append({
"agent_id": agent_id,
"input_summary": self._summarize_inputs(input_messages),
"output": output,
"tokens_used": tokens_used,
"latency_ms": latency_ms,
"tool_calls": tool_calls or [],
"timestamp": time.time()
})
def get_total_cost(self, model_pricing: dict) -> float:
total = 0
for trace in self.traces:
pricing = model_pricing.get(trace.get("model", "gpt-4"), {})
total += trace["tokens_used"] * pricing.get("per_token", 0)
return total
def visualize_flow(self) -> str:
"""Generate a Mermaid diagram of agent interactions."""
# Implementation for debugging
pass
Practical Recommendations
Start with a single agent and tools. Only add agents when you've confirmed that a single agent genuinely can't handle the task. The overhead is real.
Invest in your review/verification agent. It has the highest ROI of any agent in the system.
Set hard token budgets per agent and per workflow. Without budgets, multi-agent systems will happily spend $50 on a task worth $0.50 of value.
Use structured output (JSON schemas) for inter-agent communication. Free-text messages between agents create parsing failures and ambiguity.
Build observability from day one. You cannot debug what you cannot trace.
Test with adversarial inputs. Multi-agent systems fail in interesting ways — one agent's hallucination becomes another agent's "fact." Build evaluation sets that specifically target these failure modes.
Where This Is Heading
The current generation of multi-agent frameworks is infrastructure — they handle message routing, conversation management, and tool integration. The next generation will need to address:
- Dynamic team composition — agents joining and leaving based on task requirements
- Learning from collaboration — agents improving their coordination patterns over time
- Cost-aware routing — automatically selecting cheaper models for tasks that don't require top-tier reasoning
- Standardized inter-agent protocols — something like A2A (Agent-to-Agent protocol) or MCP (Model Context Protocol) becoming universal
- Formal verification — proving properties about multi-agent system behavior rather than just testing
The multi-agent paradigm isn't going away. But we're in the "assembly language" era — the abstractions are primitive, the debugging tools are rudimentary, and the failure modes are poorly understood. Build with caution, measure everything, and be honest about whether the complexity is earning its keep.