Coordination Patterns: Choreography vs Orchestration to Tame Exponential Complexity
Scaling from one agent to five explodes complexity from 0 to at least 10 coordination points, each a potential race condition or failure—25x harder overall. A production credit decisioning system with five agents (credit score, income verification, risk assessment, fraud detection, approval) deployed in 3 days saw 20% incorrect risk ratings because the risk agent read stale cache data (750 score written, but 680 read 500ms later due to failed invalidation). Fix by choosing patterns deliberately.
Choreography suits simple, event-driven workflows with high autonomy: agents publish/subscribe to events (e.g., research-completed triggers analysis-ready) via message bus, enabling loose coupling and easy agent addition. But debugging fails silently without bulletproof observability—avoid unless tracing events end-to-end.
Orchestration fits complex dependencies, stable workflows, and rollback needs (e.g., financial services): central orchestrator (like LangGraph in Databricks Mosaic AI Agent Framework) manages DAGs, state, parallelism, retries. Agents stay dumb—input in, output out—no direct calls. Provides full execution graph visibility, single dashboard, easy tracing.
Decision framework: Simple workflow + high autonomy = choreography; complex + low autonomy = orchestration; complex + high autonomy = hybrid with sagas. Use this matrix to prevent regret—teams defaulting to 'agentic' choreography burn months firefighting.
Immutable State Snapshots and Data Contracts Eliminate Races
Shared mutable state causes lost updates: two agents read 680 score, one writes 750, other 720—last write wins. Databases offer row locks/isolation, but defaults ship races to production.
Instead, use immutable snapshots with versioning: agents produce sealed state objects (frozen Python dataclass with version, payload, creator). Handoffs validate schema (data contracts), increment version, append as new row in orchestrator's append-only log (e.g., Delta Lake table). No updates—only inserts. Trace lineage: binary search versions to debug (e.g., version 7 bad? Check 6→5).
Contracts enforce boundaries: research agent outputs {findings, confidence, sources}; analysis rejects if confidence <0.7. Register in Unity Catalog for versioning/governance. Outcome: no stale reads, clear evolution, audit replay—race conditions impossible.
Failure Recovery: Circuit Breakers and Saga Compensation for 24/7 Reliability
Agents fail (LLM timeouts, rate limits)—design for it. Circuit breaker per agent call: 5 consecutive failures open circuit (fail fast, no bombardment); 60s timeout → half-open (test one request); success closes, failure reopens. Log transitions in MLflow; enforce at serving (Databricks Model Serving/AI Gateway). Prevents cascades—one down, degrade gracefully (skip, cache, alert).
Saga compensation for partial failures: each agent has execute() and compensate() (reversible ops). Orchestrator tracks executed list; on fail, reverse-call compensates (e.g., analysis deletes draft, research clears cache). Restores initial state, provides transactions over distributed agents—essential for finance.
Production Architecture: Databricks Stack for Billions of Transactions
Orchestrator (LangGraph) holds workflow engine/state store, queries observability. Calls agents (Unity Catalog functions/models in SQL/Python) via serving layer (circuit breakers, retries). State versions → immutable Delta rows, tied to MLflow traces (latency, inputs/outputs, tokens, LLM judges). Unity Catalog governs/lineages data/agents. Agent Brakes packages patterns. Result: 24/7 across billions, full control—handles parallelism, rollbacks, debugging.