GLM-5.1 Tops Agentic Leaderboards as Cheap Open Coder

GLM-5.1 post-train update excels in long-running agentic tasks and coding (2nd on agentic leaderboard, 5th overall), feels snappier by skipping unnecessary reasoning, but regresses in general chat and math.

Agentic and Coding Strengths Outpace Prior Versions

GLM-5.1, a post-training update to GLM-5 with unchanged parameters, prioritizes agentic workflows, making it superior for instruction following, debugging, planning, and staying on-task without deviation. It handles long-running tasks better than GLM-5 by taking context fully before acting and avoiding over-reasoning on simple prompts, resulting in snappier responses—ideal for production coding pipelines.

In agent setups like OpenClaw or kilo CLI, it self-fixes errors (e.g., running lint and iterating until code works), producing functional apps in one prompt: a Go terminal calculator with Bubble Tea, a Conbon app in Svelte with working database, and a movie tracker UI that outperforms Claude 3.5 Sonnet or o1 in completeness. For creative coding, it generates working floor plans, SVG pandas holding burgers, Three.js Pokeballs, Kandinsky-style Minecraft clones, flying butterflies in gardens, Rust CLI tools, and Blender scripts—all more focused and effective than GLM-5 or competitors like Codex.

This coding bias stems from heavy RLHF on code data, leading it to output HTML/code even for riddles (e.g., building a page for 'smoke' answer), but enhances reliability in tool-using agents.

Regressions Limit Non-Agentic Use

General chat weakens compared to GLM-5: it fails math questions, overuses code/HTML unnecessarily (triggered by system prompts like 'use codeblocks where necessary'), and feels less natural without agent scaffolding. Avoid for pure conversation or non-coding queries—pair with tools like OpenClaw to route math/tools externally.

Team acknowledged the code-overuse issue and may patch pre-release.

Benchmark Ranks and Cost Edge

Kingbench scores: 5th overall (regression in general tasks offsets agentic gains), 2nd on agentic leaderboard—insane for an open model challenging closed giants like Opus or Codex at fraction of cost.

Availability: Coding plan/API first (live ~12 hours post-video), weights soon after, no big launch. Switch to it for cheap, production-grade coding agents over pricier options.

Video description
In this video, I'll be sharing my early thoughts on GLM-5.1 after getting early access from Z AI. I’ll talk about how it improves long-running and agentic tasks, where it regresses in general chat, and why it might be one of the best cheap open models for coding right now. -- Key Takeaways: 🚀 GLM-5.1 is mostly a post-train update to GLM-5, and it is noticeably better at long-running and agentic tasks. 🤖 The model is much more coding-focused now, sometimes using code or HTML even when a simple text answer would be better. 🛠️ Instruction following, debugging, and staying focused on the main objective are all much better than before. ⚡ GLM-5.1 feels snappier than GLM-5 because it does less unnecessary reasoning on simple tasks. 📉 General chat performance seems weaker now, especially for math and non-agentic use cases. 🏆 In my tests, it ranks 5th on the overall leaderboard and 2nd on the agentic leaderboard, which is very impressive. 💸 For the price, GLM-5.1 feels extremely competitive and could be a serious alternative to models like Codex and Opus for coding workflows.

Summarized by x-ai/grok-4.1-fast via openrouter

5175 input / 1092 output tokens in 8036ms

© 2026 Edge