Rainbow Deploys: Git SHA Kubernetes for Stateful Drains
For stateful services like websocket backends needing hours to drain connections, deploy Kubernetes with git SHA-named Deployments, switch Service selectors to new ones, and manually delete old after traffic burns down—avoids mass reconnects unlike rolling updates.
Zero-Downtime Deploys for Stateful Services
Stateful services like Olark's chat backend hold websocket-to-XMPP connections per pod; sudden pod restarts force all users to reconnect, spiking load. Traditional Kubernetes rolling deploys kill pods immediately after new ones start, disrupting everyone. Instead, run multiple parallel Deployments indefinitely until connections drain naturally over 24-48 hours. Each Deployment needs 16 pods (2GB RAM, 1 CPU each). This preserves user sessions without hacks like hot code reloading, which containers avoid.
Previous approaches failed: Porting 'up' tool (forks new workers, drains old over days) required unreliable session stickiness via service-loadbalancer and hours-long terminationGracePeriodSeconds, but dropped connections prematurely. Blue/Green with two Deployments limited deploys to once daily due to drain times; scaling to 8 colors for 4x/day deploys idled 128 pods constantly, wasting resources.
Git SHA Rainbow Deployments
Name Deployments after git commit SHAs (first 6 chars double as hex colors): chat-olark-com-<SHA>. Deploy process:
- Create new Deployment
chat-olark-com-<NEW_SHA>. - Scale to 16 ready pods.
- Update Service selector to match new Deployment's labels, routing all traffic there instantly.
- Rollback by switching Service back to
<OLD_SHA>. - After 24-48 hours, when connections burn down (few users left reconnect to newer), delete old Deployment.
Demo at https://github.com/bdimcheff/rainbow-deploys shows YAML and GitLab CI pipelines used since June 2017—simpler and more reliable than alternatives. No production downtime, deploys as frequent as needed without fixed color limits.
Cleanup Challenges and Future Ideas
Manual cleanup inspects connection counts to avoid disruption; automation hard since detecting low traffic reliably eludes simple metrics. Ideal Kubernetes evolution: Native 'Immutable' strategy creates new pods without auto-killing old, plus lifecycle hooks signaling pods to self-shutdown when deselected from Service. Until then, rainbow SHAs scale indefinitely without resource bloat of pre-provisioned colors.