GLM-5.1 Excels in Long-Horizon Agentic Coding

GLM-5.1 tops SWE-Bench Pro at 58.4% and sustains gains over 600+ iterations on VectorDBBench (21.5k QPS, 6x prior best) and 1,000+ turns on KernelBench (3.6x speedup), enabling complex builds like a full Linux desktop in 8 hours.

Sustains Optimization Over Extended Horizons for Superior Results

GLM-5.1 breaks through short-session limits by maintaining productivity across hundreds of iterations and thousands of tool calls, revising strategies based on benchmarks and self-analysis. On VectorDBBench, it optimizes a Rust vector database for SIFT-1M (Recall ≥95%), starting from a skeleton with HTTP endpoints. Over 655 iterations and 6,000+ tool calls, it hits 21.5k QPS—6x the prior 3.5k QPS best from Claude Opus 4.6 in 50 turns. Progress follows a staircase: incremental tuning plateaus, then structural shifts like IVF probing with f16 compression (iteration ~90, 6.4k QPS) or u8 prescoring + f16 reranking (iteration ~240, 13.4k QPS) unlock jumps, with temporary Recall dips during exploration.

On KernelBench Level 3 (50 full-model optimizations like MobileNet, VGG), GLM-5.1 delivers 3.6x geometric mean speedup vs. PyTorch baseline (torch.compile max-autotune: 1.49x), sustaining gains over 1,200 turns per problem in isolated H100 Docker. It outlasts GLM-5 (early plateau) and Claude Opus 4.5, trailing only Opus 4.6 (4.2x) but showing more headroom. Audits confirm no benchmark exploits.

For open-ended tasks without metrics, GLM-5.1 builds a Linux desktop web app from scratch over 8 hours: starts with taskbar/windows, iterates to add file browser, terminal, editor, monitor, calculator, games—polishing UI, interactions, and edge cases via self-review loops.

Tops Coding and Agentic Benchmarks with Precise Judgment

GLM-5.1 leads SWE-Bench Pro at 58.4% (vs. GLM-5 55.1%, GPT-5.4 57.7%), NL2Repo at 42.7% (vs. GLM-5 35.9%), Terminal-Bench 2.0 Terminus-2 at 63.5%, and CyberGym at 68.7% (Claude Code harness). In agentic evals: BrowseComp w/ Context Manage 79.3%, τ³-Bench 70.6%, MCP-Atlas 71.8%, Tool-Decathlon 40.7%, Vending Bench 2 $5,634 revenue. Reasoning holds strong: HLE w/ Tools 52.3%, AIME 2026 95.3%, GPQA-Diamond 86.2%. Settings emphasize long contexts (up to 202k tokens) and tool use without hacking (rule + model detection).

Trade-offs: Higher quota use (2-3x), challenges remain in escaping local optima, trace coherence over thousands of calls, and metric-free self-eval.

Deploy Immediately for Agentic Workflows

Open-source (MIT) on GitHub/HuggingFace/ModelScope; infer with vLLM/SGLang. API on api.z.ai/BigModel.cn; compatible with Claude Code/OpenClaw. GLM Coding Plan: update to "GLM-5.1" (1-3x quota, promo 1x off-peak); GUI via Z Code for multi-agent SSH/phone tasks. Z.ai chat rollout soon.

Summarized by x-ai/grok-4.1-fast via openrouter

7818 input / 1969 output tokens in 9224ms

© 2026 Edge