GLM-5.1 Tops Agentic Leaderboards as Cheap Open Coder

Video description

In this video, I'll be sharing my early thoughts on GLM-5.1 after getting early access from Z AI. I’ll talk about how it improves long-running and agentic tasks, where it regresses in general chat, and why it might be one of the best cheap open models for coding right now. -- Key Takeaways: 🚀 GLM-5.1 is mostly a post-train update to GLM-5, and it is noticeably better at long-running and agentic tasks. 🤖 The model is much more coding-focused now, sometimes using code or HTML even when a simple text answer would be better. 🛠️ Instruction following, debugging, and staying focused on the main objective are all much better than before. ⚡ GLM-5.1 feels snappier than GLM-5 because it does less unnecessary reasoning on simple tasks. 📉 General chat performance seems weaker now, especially for math and non-agentic use cases. 🏆 In my tests, it ranks 5th on the overall leaderboard and 2nd on the agentic leaderboard, which is very impressive. 💸 For the price, GLM-5.1 feels extremely competitive and could be a serious alternative to models like Codex and Opus for coding workflows.

Agentic and Coding Strengths Outpace Prior Versions

GLM-5.1, a post-training update to GLM-5 with unchanged parameters, prioritizes agentic workflows, making it superior for instruction following, debugging, planning, and staying on-task without deviation. It handles long-running tasks better than GLM-5 by taking context fully before acting and avoiding over-reasoning on simple prompts, resulting in snappier responses—ideal for production coding pipelines.

In agent setups like OpenClaw or kilo CLI, it self-fixes errors (e.g., running lint and iterating until code works), producing functional apps in one prompt: a Go terminal calculator with Bubble Tea, a Conbon app in Svelte with working database, and a movie tracker UI that outperforms Claude 3.5 Sonnet or o1 in completeness. For creative coding, it generates working floor plans, SVG pandas holding burgers, Three.js Pokeballs, Kandinsky-style Minecraft clones, flying butterflies in gardens, Rust CLI tools, and Blender scripts—all more focused and effective than GLM-5 or competitors like Codex.

This coding bias stems from heavy RLHF on code data, leading it to output HTML/code even for riddles (e.g., building a page for 'smoke' answer), but enhances reliability in tool-using agents.

Regressions Limit Non-Agentic Use

General chat weakens compared to GLM-5: it fails math questions, overuses code/HTML unnecessarily (triggered by system prompts like 'use codeblocks where necessary'), and feels less natural without agent scaffolding. Avoid for pure conversation or non-coding queries—pair with tools like OpenClaw to route math/tools externally.

Team acknowledged the code-overuse issue and may patch pre-release.

Benchmark Ranks and Cost Edge

Kingbench scores: 5th overall (regression in general tasks offsets agentic gains), 2nd on agentic leaderboard—insane for an open model challenging closed giants like Opus or Codex at fraction of cost.

Availability: Coding plan/API first (live ~12 hours post-video), weights soon after, no big launch. Switch to it for cheap, production-grade coding agents over pricier options.

Video description

Agentic and Coding Strengths Outpace Prior Versions

Regressions Limit Non-Agentic Use

Benchmark Ranks and Cost Edge

More on Edge

CoCoDA: Co-Evolve DAGs to Scale Tool-Augmented Agents

Uber's OpenAI-Powered Multi-Agent AI Optimizes Earnings and Booking

Claude Code's 5-Layer Agent Kit Fixes Common Failures

Reward Queries to Fix RAG Agent Failures