Top Models Excel on Filament Enum Integration, Others Falter

To correctly render PHP enums in Filament forms and tables with auto-coloring and labels, implement HasColor and HasLabel interfaces on your enum (e.g., PostStatus). Filament then handles badges without extra code—just specify badge(). The test prompt targeted this: generate a Filament resource using enums properly.

Tested 6 LLMs with identical prompt, 3 runs each, validated via automated tests:

  • Claude 3 Opus and GPT-4o (latest): 3/3 perfect—no test failures.
  • Qwen2.5 (Kimmy) and Gemini 1.5 Pro: 2/3 successes. Gemini's third fail stemmed from namespace error (500 error, page unloadable), unrelated to enums.
  • GLM and Miniax: GLM 1/3 (implemented HasColor but missed HasLabel, causing form failure); Miniax 0/3 (no Filament enum awareness, feature broke).

Key lesson: Frontier models (Opus, GPT) consistently grasp niche frameworks like Filament better than alternatives, except Qwen which punches above its weight and costs less via OpenRouter API. Opus/GPT used subscriptions (GPT: 15% of 5-hour limit; Opus: 28%), while others via OpenRouter (Qwen cheapest). Opus was fastest but token-hungriest; GPT slower but efficient.

Intra-Model Variability Requires Manual Review

Even flawless runs (per tests) aren't identical. Used GPT-4o to diff 3 runs each from Opus and GPT-4o:

Opus runs: Enum/model identical. Differences:

  • Return types: Run 1 added them; runs 2-3 used string (both valid).
  • Fillable: Run 3 used PHP attribute (#[Fillable] from Laravel 11+); runs 1-2 used array (personal pref, both work).
  • Form defaults: Slight value tweaks (Filament flexible).
  • Table extras: Run 2 added unrequired filter and title attr (UX win, but optional).

GPT-4o runs: Enum identical. Differences:

  • Textarea rows=8 (UI choice).
  • Badge sortable() (UX decision).
  • Phrasing/finishing details vary.

Proof: Same prompt yields small but meaningful diffs in UX (e.g., sortable tables), defaults, or polish. LLMs make unprompted choices—review line-by-line, especially git diffs on details like rows or attributes. For complex code, build eval tools to scale checks.

Practical Takeaways for LLM Coding

Run prompts 3x+ and average for reliability—single runs risk flukes (e.g., GLM's 1/3 win). Prioritize Opus/GPT for framework-specific tasks; test Qwen for cost savings. Costs matter: OpenRouter API pricing favors Qwen over GLM. Token usage hints at efficiency (GPT leaner despite slowness). Hypothesis validated: Variability persists, so treat LLM code as first draft—manual audit catches "devil in details." Future: Test bigger scenarios with automated pipelines.