The Computer Use Loop
Gemini 3.5 Flash implements 'Computer Use' as a native tool, enabling the model to interact with graphical interfaces by observing screenshots and outputting structured function calls. The interaction follows a continuous feedback loop: the model receives a screenshot and a goal, returns a function call (e.g., click, type, swipe), the bridge executes that action on the device via Android Debug Bridge (ADB), and the updated state is captured in a new screenshot to be sent back to the model.
Implementing the Bridge
To enable this, you must build a bridge that translates the model's normalized coordinate system (a 0-999 grid) into the specific pixel resolution of the target device. The implementation requires handling several core actions:
- Navigation:
click,long_press,scroll, andgo_back. - Input:
type(text input) andpress_key(system keys like home or back). - State Management:
open_appandlist_appsto manage the device environment.
The provided implementation uses a Python ADBBridge class that wraps adb shell commands. For production, the author notes that this synchronous approach should be replaced with robust error handling, asynchronous execution, and specific logic to handle safety_decision flags, which the model may trigger for sensitive actions like payments or state changes.
Platform Agnostic Control
While the current implementation focuses on Android via ADB, the Gemini API's mobile environment is platform-agnostic. The model's output remains consistent regardless of the underlying OS. To extend this to iOS, developers can replace the ADBBridge with tools like simctl for simulators or go-ios for physical devices, maintaining the same core agent loop logic.