LLM Outputs Vary Across Runs: 6 Models Tested 3x Each

Top Models Excel on Filament Enum Integration, Others Falter

To correctly render PHP enums in Filament forms and tables with auto-coloring and labels, implement HasColor and HasLabel interfaces on your enum (e.g., PostStatus). Filament then handles badges without extra code—just specify badge(). The test prompt targeted this: generate a Filament resource using enums properly.

Tested 6 LLMs with identical prompt, 3 runs each, validated via automated tests:

Claude 3 Opus and GPT-4o (latest): 3/3 perfect—no test failures.
Qwen2.5 (Kimmy) and Gemini 1.5 Pro: 2/3 successes. Gemini's third fail stemmed from namespace error (500 error, page unloadable), unrelated to enums.
GLM and Miniax: GLM 1/3 (implemented HasColor but missed HasLabel, causing form failure); Miniax 0/3 (no Filament enum awareness, feature broke).

Key lesson: Frontier models (Opus, GPT) consistently grasp niche frameworks like Filament better than alternatives, except Qwen which punches above its weight and costs less via OpenRouter API. Opus/GPT used subscriptions (GPT: 15% of 5-hour limit; Opus: 28%), while others via OpenRouter (Qwen cheapest). Opus was fastest but token-hungriest; GPT slower but efficient.

Intra-Model Variability Requires Manual Review

Even flawless runs (per tests) aren't identical. Used GPT-4o to diff 3 runs each from Opus and GPT-4o:

Opus runs: Enum/model identical. Differences:

Return types: Run 1 added them; runs 2-3 used string (both valid).
Fillable: Run 3 used PHP attribute (#[Fillable] from Laravel 11+); runs 1-2 used array (personal pref, both work).
Form defaults: Slight value tweaks (Filament flexible).
Table extras: Run 2 added unrequired filter and title attr (UX win, but optional).

GPT-4o runs: Enum identical. Differences:

Textarea rows=8 (UI choice).
Badge sortable() (UX decision).
Phrasing/finishing details vary.

Proof: Same prompt yields small but meaningful diffs in UX (e.g., sortable tables), defaults, or polish. LLMs make unprompted choices—review line-by-line, especially git diffs on details like rows or attributes. For complex code, build eval tools to scale checks.

Practical Takeaways for LLM Coding

Run prompts 3x+ and average for reliability—single runs risk flukes (e.g., GLM's 1/3 win). Prioritize Opus/GPT for framework-specific tasks; test Qwen for cost savings. Costs matter: OpenRouter API pricing favors Qwen over GLM. Token usage hints at efficiency (GPT leaner despite slowness). Hypothesis validated: Variability persists, so treat LLM code as first draft—manual audit catches "devil in details." Future: Test bigger scenarios with automated pipelines.

Top Models Excel on Filament Enum Integration, Others Falter

Intra-Model Variability Requires Manual Review

Practical Takeaways for LLM Coding

More on Edge

Opus 4.7 Tops Coding Benchmarks but Needs Explicit Prompts

Grill AI to Align Before Coding in Smart Zone

GPT-5.5 Excels in Coding Execution with Opus 4.7 Plans

Claude 'Regressions' Stem from Harnesses and APIs, Not Dumber Models