Top Models Excel on Filament Enum Integration, Others Falter
To correctly render PHP enums in Filament forms and tables with auto-coloring and labels, implement HasColor and HasLabel interfaces on your enum (e.g., PostStatus). Filament then handles badges without extra code—just specify badge(). The test prompt targeted this: generate a Filament resource using enums properly.
Tested 6 LLMs with identical prompt, 3 runs each, validated via automated tests:
- Claude 3 Opus and GPT-4o (latest): 3/3 perfect—no test failures.
- Qwen2.5 (Kimmy) and Gemini 1.5 Pro: 2/3 successes. Gemini's third fail stemmed from namespace error (500 error, page unloadable), unrelated to enums.
- GLM and Miniax: GLM 1/3 (implemented
HasColorbut missedHasLabel, causing form failure); Miniax 0/3 (no Filament enum awareness, feature broke).
Key lesson: Frontier models (Opus, GPT) consistently grasp niche frameworks like Filament better than alternatives, except Qwen which punches above its weight and costs less via OpenRouter API. Opus/GPT used subscriptions (GPT: 15% of 5-hour limit; Opus: 28%), while others via OpenRouter (Qwen cheapest). Opus was fastest but token-hungriest; GPT slower but efficient.
Intra-Model Variability Requires Manual Review
Even flawless runs (per tests) aren't identical. Used GPT-4o to diff 3 runs each from Opus and GPT-4o:
Opus runs: Enum/model identical. Differences:
- Return types: Run 1 added them; runs 2-3 used
string(both valid). - Fillable: Run 3 used PHP attribute (
#[Fillable]from Laravel 11+); runs 1-2 used array (personal pref, both work). - Form defaults: Slight value tweaks (Filament flexible).
- Table extras: Run 2 added unrequired filter and title attr (UX win, but optional).
GPT-4o runs: Enum identical. Differences:
- Textarea
rows=8(UI choice). - Badge
sortable()(UX decision). - Phrasing/finishing details vary.
Proof: Same prompt yields small but meaningful diffs in UX (e.g., sortable tables), defaults, or polish. LLMs make unprompted choices—review line-by-line, especially git diffs on details like rows or attributes. For complex code, build eval tools to scale checks.
Practical Takeaways for LLM Coding
Run prompts 3x+ and average for reliability—single runs risk flukes (e.g., GLM's 1/3 win). Prioritize Opus/GPT for framework-specific tasks; test Qwen for cost savings. Costs matter: OpenRouter API pricing favors Qwen over GLM. Token usage hints at efficiency (GPT leaner despite slowness). Hypothesis validated: Variability persists, so treat LLM code as first draft—manual audit catches "devil in details." Future: Test bigger scenarios with automated pipelines.