Solving Race Conditions in Voice AI with Deferred Dispatch

The Failure of Synchronous Assumptions

In voice AI pipelines, developers often compose a fast, cheap detector (e.g., Voice Activity Detection/VAD) with a slower, deep recognizer (e.g., ASR/Vosk). A race condition occurs because the VAD fires its 'end-of-speech' event significantly earlier (344ms in this case) than the ASR finishes decoding. Attempting to fix this with boolean flags fails because the flag is set after the routing decision has already been made. Conversely, blocking the audio thread with a CountDownLatch to wait for the ASR result causes audio buffer drops, degrading recognition quality.

The Deferred Dispatch Pattern

Instead of deciding immediately, the system should defer the routing decision. The implementation follows a three-step pattern:

Buffer: When the VAD fires, hold the audio data in a slot rather than sending it to the server.
Race: Start a timer (e.g., 400ms) on the main thread.
Resolve: If the ASR returns a command first, cancel the timer and execute the command. If the timer expires first, assume no command was detected and send the buffered audio as a new question.

This approach avoids blocking latency-critical threads and ensures the decision is made with full information. The 400ms deadline is not arbitrary; it is calibrated based on the measured distribution of the VAD-to-ASR latency (250–400ms) to ensure the system remains responsive while capturing all valid commands.

Critical Implementation Details

Symmetric Cleanup: Deferred work must be explicitly canceled at every possible state transition (e.g., returning to wake-word mode, re-entering follow-up, or service destruction). Failing to clean up the Runnable can lead to memory leaks or logic corruption where old audio is processed in a new context.
Avoid Thread Blocking: Never block the audio capture thread to wait for secondary components. The audio thread's sole responsibility is maintaining the microphone buffer.
Generalization: This pattern applies to any multi-modal or multi-component pipeline where a routing decision requires inputs from both a fast detector and a slow classifier, such as sensor fusion (GPS vs. IMU) or multi-modal LLM inputs (image vs. text classification).

The Failure of Synchronous Assumptions

The Deferred Dispatch Pattern

Critical Implementation Details

More from Software Engineering

Claude Code's CI Auto-Fix Closes PR Review Loop at $25 Each

VS Code April 2026: Agents Window and Copilot CLI Upgrades

Optimizing Software Delivery with AI-Assisted Code Reviews

Reducing API Testing Boilerplate with APItestGenie