The Coordination Boundary

Coordination primitives (like lease stores) provide specific, limited guarantees: they prevent concurrent access via fencing counters and manage state handoff. However, these guarantees do not extend to external systems. A worker that is fenced out of the lease store may have already triggered external side effects (e.g., API calls, webhooks, or database mutations).

Attempting to solve these external side effects within a coordination library is a design error. Libraries that try to own recovery logic or external state management become too opinionated and brittle. Instead, the library should draw a hard line at the lease store, leaving the application to handle idempotency and outbox patterns based on its specific business requirements.

Managing the At-Least-Once Window

Distributed systems using lease handoffs inherently operate with an "at-least-once" execution window. This is not a bug, but a trade-off between throughput and recovery speed.

  • Frequent Checkpointing: Reduces the amount of work re-executed during recovery but increases write overhead to the lease store.
  • Infrequent Checkpointing: Improves throughput but increases the volume of work that must be re-run if a worker crashes.

The library provides the mechanism for checkpointing, but the frequency and the logic of what constitutes a "checkpoint" (e.g., partial state validation) must be defined by the caller, as they are the only ones who understand the cost of re-execution and business tolerance for duplicates.

Observability as a Verification Tool

Because coordination guarantees are claims, they must be verified under load. Instrumentation is not for debugging the library, but for tuning the configuration against the workload.

Using a LeaseObserver interface allows developers to inject observability without forcing specific framework dependencies (like OpenTelemetry) onto the library. By tracking metrics like ErrFenced frequency, renewal failures, and checkpoint duration, developers can identify if their TTLs are too aggressive or if their workers are under-provisioned. This approach keeps the library lightweight while providing the necessary seams for production-grade monitoring.