Connection Ownership Mismatch Causes Silent Failures

In single-instance deployments, callbacks from async workflows reach the same process holding the client's SSE or WebSocket connection, delivering updates instantly. Horizontal scaling with Kubernetes replicas behind a load balancer breaks this: clients connect to one pod (e.g., Pod A), but callbacks hit another (Pod B). Pod B processes correctly—validates, logs, persists state, returns 200 OK—but can't deliver since it lacks the in-memory connection. Users see no updates despite healthy metrics (low CPU, latency, errors). This 'distributed client-context problem' emerges because stateless services scale execution but not long-lived connections, which remain process-local state.

Cloud-native statelessness excels for scaling and recovery but ignores that connections bind to specific replicas. Async webhooks and background jobs land anywhere, decoupling execution from delivery without explicit coordination.

Decouple Processing from Delivery Using Pub/Sub

Sticky sessions or switching SSE to WebSockets fail because they don't solve replica mismatch. Instead, add a broadcast layer: receiving replica publishes events to a shared channel (Redis Pub/Sub fits for low-latency fan-out). All replicas subscribe; only the connection-owning one forwards to the client.

Derive stable channel IDs from user/request IDs. Each pod maps these to active in-memory connections via a shared subscriber, avoiding per-client subscriptions that don't scale. Clean up mappings on disconnect to prevent stale references, memory leaks, or race conditions during reconnects. This makes delivery predictable without routing callbacks to specific pods.

Stateless services don't eliminate state—they relocate it (e.g., to Redis). Coordination treats delivery as a separate concern from processing, enabling clean horizontal scaling.

Monitor End-to-End Delivery, Not Just Processing

Dashboards miss this: processing succeeds (green metrics), but delivery fails silently. Propagate correlation IDs across initiation, callback, publication, and client receipt to trace divergences. Alert on coordination health—e.g., published events without deliveries—beyond infrastructure metrics.

Make updates idempotent: duplicates harmless, misses recoverable by client polling authoritative backend state. Streaming enhances UX but isn't correctness; backend state remains source of truth. Redis Pub/Sub's transience (lost on restarts) reinforces this discipline.

Design Rules Prevent Recurrence

  • Treat connections as local state, not shared.
  • Broadcast for any-node completion.
  • Track full-path delivery with correlation IDs.
  • Ensure idempotency and authoritative state.

Ask upfront: which replica owns the connection, and how does the system find it? This beats transport tweaks. Modern Kubernetes dynamism, webhook reliance, and real-time UIs amplify the issue in event-driven SaaS.