GLM-5.1 Tops Agentic Leaderboards as Cheap Open Coder

Agentic and Coding Strengths Outpace Prior Versions

GLM-5.1, a post-training update to GLM-5 with unchanged parameters, prioritizes agentic workflows, making it superior for instruction following, debugging, planning, and staying on-task without deviation. It handles long-running tasks better than GLM-5 by taking context fully before acting and avoiding over-reasoning on simple prompts, resulting in snappier responses—ideal for production coding pipelines.

In agent setups like OpenClaw or kilo CLI, it self-fixes errors (e.g., running lint and iterating until code works), producing functional apps in one prompt: a Go terminal calculator with Bubble Tea, a Conbon app in Svelte with working database, and a movie tracker UI that outperforms Claude 3.5 Sonnet or o1 in completeness. For creative coding, it generates working floor plans, SVG pandas holding burgers, Three.js Pokeballs, Kandinsky-style Minecraft clones, flying butterflies in gardens, Rust CLI tools, and Blender scripts—all more focused and effective than GLM-5 or competitors like Codex.

This coding bias stems from heavy RLHF on code data, leading it to output HTML/code even for riddles (e.g., building a page for 'smoke' answer), but enhances reliability in tool-using agents.

Regressions Limit Non-Agentic Use

General chat weakens compared to GLM-5: it fails math questions, overuses code/HTML unnecessarily (triggered by system prompts like 'use codeblocks where necessary'), and feels less natural without agent scaffolding. Avoid for pure conversation or non-coding queries—pair with tools like OpenClaw to route math/tools externally.

Team acknowledged the code-overuse issue and may patch pre-release.

Benchmark Ranks and Cost Edge

Kingbench scores: 5th overall (regression in general tasks offsets agentic gains), 2nd on agentic leaderboard—insane for an open model challenging closed giants like Opus or Codex at fraction of cost.

Availability: Coding plan/API first (live ~12 hours post-video), weights soon after, no big launch. Switch to it for cheap, production-grade coding agents over pricier options.