The Split-Brain Architecture
Ezzi, an interview assistant, uses a two-stage AI pipeline to process screen captures. First, a low-cost, high-speed model (Claude Sonnet) transcribes the screen and classifies the content. Second, a high-effort reasoning model (Claude Opus) generates the solution based solely on the text transcript.
This architecture is cost-effective and fast, but it creates a critical failure point: the reasoning model never sees the original image. If the transcription stage makes a subtle error—such as misreading a constraint or a variable—the reasoning model will confidently solve the wrong problem. Because the output is polished and well-formatted, these errors are difficult for users to detect.
Engineering Reliability Through Constraints
Rather than relying on complex computer vision preprocessing, the author focuses on deterministic prompt-level controls:
- Classification Gates: Before triggering the expensive reasoning model, the system forces the transcription model to categorize the screenshot (e.g., DSA, SQL, System Design, or Not-a-Problem). If the input is a billing page or Slack window, the system rejects it immediately, saving costs and preventing hallucinations.
- Debug Loops: When a solution fails, the system reconstructs the conversation history, allowing the model to see the original problem, its previous attempt, and the specific error message. This context-aware approach prevents the model from repeating the same mistakes.
- Lossy Compression Management: The system captures full-resolution screenshots but compresses them to fit API limits. While this can smear small text, the author notes that modern vision models are robust enough that theme-aware preprocessing (light vs. dark mode) is unnecessary.
The "Confident Failure" Problem
The most significant challenge in production is the lack of a confidence signal. The current pipeline reports success or failure but never expresses uncertainty. The author argues that the next evolution of AI-powered tools must move beyond simple success/failure flags and implement mechanisms to signal when a model is unsure of its transcription. Without this, users are left with a dangerous "silent failure" mode where the system provides a perfectly formatted, yet factually incorrect, answer to a problem that was misread at the pixel level.