Architecture and Efficiency
Step 3.7 Flash is a sparse Mixture-of-Experts (MoE) vision-language model with 198B total parameters (196B language backbone + 1.8B ViT encoder). By activating only ~11B parameters per token, it achieves high-performance reasoning while maintaining the inference compute profile of a much smaller dense model. The model supports a 256k token context window and offers three selectable reasoning depths, allowing developers to trade latency for depth depending on the task.
Agentic Performance and Advisor Mode
Step 3.7 Flash shows significant gains in coding benchmarks, scoring 56.26% on SWE-Bench Pro and 59.55% on Terminal-Bench 2.1. A standout feature is its "Advisor Mode," which implements an agentic strategy where the model handles the full execution loop and only escalates to a larger advisor model during critical planning or failure recovery. This approach reportedly achieves 97% of Claude Opus 4.6's performance at roughly one-ninth the cost ($0.19 vs. $1.76 per task).
Multimodal and Tool-Use Capabilities
Unlike its predecessor, Step 3.7 Flash includes native vision support via two distinct pathways:
- Visual Search Tool: Used for long-tail or recently emerged concepts where parametric knowledge is insufficient.
- Python Tool: Enables fine-grained visual analysis (e.g., cropping, zooming, bounding-box analysis) by allowing the model to interact with images via code.
Testing revealed emergent compositional tool use, where the model integrated visual and non-visual tools without explicit training—for example, rendering frontend code in a GUI to inspect results before iterating. It also demonstrates strong performance in long-horizon UI tasks, scoring 61.87% on the Android Daily benchmark.