Step 3.7 Flash: A 198B MoE Model for Agentic Workflows

Architecture and Efficiency

Step 3.7 Flash is a sparse Mixture-of-Experts (MoE) vision-language model with 198B total parameters (196B language backbone + 1.8B ViT encoder). By activating only ~11B parameters per token, it achieves high-performance reasoning while maintaining the inference compute profile of a much smaller dense model. The model supports a 256k token context window and offers three selectable reasoning depths, allowing developers to trade latency for depth depending on the task.

Agentic Performance and Advisor Mode

Step 3.7 Flash shows significant gains in coding benchmarks, scoring 56.26% on SWE-Bench Pro and 59.55% on Terminal-Bench 2.1. A standout feature is its "Advisor Mode," which implements an agentic strategy where the model handles the full execution loop and only escalates to a larger advisor model during critical planning or failure recovery. This approach reportedly achieves 97% of Claude Opus 4.6's performance at roughly one-ninth the cost ($0.19 vs. $1.76 per task).

Multimodal and Tool-Use Capabilities

Unlike its predecessor, Step 3.7 Flash includes native vision support via two distinct pathways:

Visual Search Tool: Used for long-tail or recently emerged concepts where parametric knowledge is insufficient.
Python Tool: Enables fine-grained visual analysis (e.g., cropping, zooming, bounding-box analysis) by allowing the model to interact with images via code.

Testing revealed emergent compositional tool use, where the model integrated visual and non-visual tools without explicit training—for example, rendering frontend code in a GUI to inspect results before iterating. It also demonstrates strong performance in long-horizon UI tasks, scoring 61.87% on the Android Daily benchmark.

Architecture and Efficiency

Agentic Performance and Advisor Mode

Multimodal and Tool-Use Capabilities

More from AI & LLMs

Next-Gen Agentic Architecture: Gemini 3.5 & ADK

Cohere's Command A+: A 218B Sparse MoE Model for Agentic Workflows

DART: Improving Agent Reliability via Semantic Recoverability

Google Transforms Gemini App into an Agentic AI Hub