The Failure of Synchronous Assumptions
In voice AI pipelines, developers often compose a fast, cheap detector (e.g., Voice Activity Detection/VAD) with a slower, deep recognizer (e.g., ASR/Vosk). A race condition occurs because the VAD fires its 'end-of-speech' event significantly earlier (344ms in this case) than the ASR finishes decoding. Attempting to fix this with boolean flags fails because the flag is set after the routing decision has already been made. Conversely, blocking the audio thread with a CountDownLatch to wait for the ASR result causes audio buffer drops, degrading recognition quality.
The Deferred Dispatch Pattern
Instead of deciding immediately, the system should defer the routing decision. The implementation follows a three-step pattern:
- Buffer: When the VAD fires, hold the audio data in a slot rather than sending it to the server.
- Race: Start a timer (e.g., 400ms) on the main thread.
- Resolve: If the ASR returns a command first, cancel the timer and execute the command. If the timer expires first, assume no command was detected and send the buffered audio as a new question.
This approach avoids blocking latency-critical threads and ensures the decision is made with full information. The 400ms deadline is not arbitrary; it is calibrated based on the measured distribution of the VAD-to-ASR latency (250–400ms) to ensure the system remains responsive while capturing all valid commands.
Critical Implementation Details
- Symmetric Cleanup: Deferred work must be explicitly canceled at every possible state transition (e.g., returning to wake-word mode, re-entering follow-up, or service destruction). Failing to clean up the
Runnablecan lead to memory leaks or logic corruption where old audio is processed in a new context. - Avoid Thread Blocking: Never block the audio capture thread to wait for secondary components. The audio thread's sole responsibility is maintaining the microphone buffer.
- Generalization: This pattern applies to any multi-modal or multi-component pipeline where a routing decision requires inputs from both a fast detector and a slow classifier, such as sensor fusion (GPS vs. IMU) or multi-modal LLM inputs (image vs. text classification).