GLM-5.1 Builds Laravel App in 20 Mins Despite Hiccups

Long-Horizon Task Execution: GLM-5.1's Iterative Delivery

GLM-5.1 handled a simplified Upwork project—build a Laravel app with Livewire for checklists, progress saving, dashboard, and PDF export—using a single prompt listing 16 tasks. Accessed via OpenRouter in VS Code, it ran for 20 minutes, generating migrations, models, seeders, and Livewire 4 components with Flux UI. It used a default Laravel + Livewire starter kit, producing functional features: users select yes/partially/no answers, save progress, view dashboard stats (e.g., 2/9 answered), mark complete, and download a basic PDF report.

The model iterated through failures autonomously: tests initially failed due to incorrect Livewire test syntax, Flux attributes (e.g., 'outlined' vs. 'outline', missing 'clipboard-check'), and non-existent components it invented. It switched from Flux radio to select for options, fixed one issue at a time after large test outputs, and passed 11 tests without Pest's short result format, consuming extra tokens. Despite lacking specific training on latest Livewire/Flux, it delivered a working first draft without manual intervention, though skills like Claude MD or Flux UI (enabled but possibly unused) could have reduced attempts from dozens to fewer.

Cost via OpenRouter: shown as $4 in VS Code client but actual $0.215 for the session, roughly half due to pricing discrepancies.

Comparison: Opus 4.6 Wins on Speed, UI, and Structure

Opus 4.6 (via Claude Code) completed the identical prompt in 6 minutes, yielding superior results. It used radio buttons instead of dropdowns for better UX, produced a more styled PDF with tables, and incorporated controllers (e.g., ChecklistPdfController for downloads, DashboardController for stats) rather than inline route closures—aligning with best practices for maintainability.

Opus avoided GLM's loops by generating cleaner code upfront, requiring an NPM build for full styling. Both used single-file Livewire components with shared app layouts, but Opus prioritized user-friendly interactions like a prominent 'Mark as Completed' confirmation.

Code Quality Gaps Exposed by Opus Review

Opus reviewed GLM's code in 3 minutes, identifying 15 issues:

Architecture: Inline closures instead of controllers (personal preference for separation).
Performance: N+1 queries in dashboard (loads full sections/items; fix with withCount for item counts only).
Validation/Security: No input validation on save (add max length, enum checks for yes/partially/no); hardcoded checklist ID for demo; missing factories, policies (minor for demo).
Efficiency: PDF download re-queries data (cache or reload from session); response constants repeated (extract to config).
Testing: Lazy database refresh; tests pass but could optimize.

GLM's code worked for the demo (functional saves, PDF gen) but needs refinements for production: multi-model reviews, validation, and optimizations. Neither is one-shot production-ready; use for drafts, then iterate.

Trade-offs: GLM-5.1 shines for endurance on complex, multi-step tasks (20+ mins viable), but lags in precision/speed vs. Opus, especially on niche stacks like Livewire v4/Flux without skills activation.