Test Claude Skills with Skill Creator + Eval Maker

Anthropic's Skill Creator 2.0 automates A/B testing for Claude skills using Grader, Blind Comparator, and Analyzer agents, but weak assertions undermine results—fix with Eval Maker for targeted evals grounded in skill purpose.

Untested Skills Hide Costly Flaws

Claude skills—sets of instructions for specific tasks—often launch with issues like vague directions, unreliable triggers, unhelpful examples, redundant text, and token waste, even if outputs seem 'good enough.' Fire-and-forget creation misses 20-40% potential gains, as every skill improves at least once via optimization. Trade-off: Initial simplicity costs reliability and efficiency; testing reveals patterns like overlapping instructions confusing the agent.

Author's Workspace Auditor skill exemplifies continuous refinement, auditing folders for Claude Code setups. Demo: Tagline Writer skill improved from baseline (67% pass rate, 23.6s time, 29,610 tokens) to 100% pass, 20.3s time, 33,400 tokens (+13% tokens but faster and stricter format adherence).

Skill Creator 2.0 Delivers Repeatable A/B Testing

Anthropic's updated Skill Creator (GitHub: anthropics/skills/tree/main/skills/skill-creator) structures skills with SKILL.md (YAML frontmatter + instructions) and optional subfolders (scripts/, references/, assets/). Core agents automate testing:

  • Grader: Pass/fail per assertion (e.g., Tagline Writer's 6/6: ≤100 chars/tagline, exactly 3 taglines, distinct angles, no invented facts, no !/emoji, casual tone).
  • Blind Comparator: Ranks outputs blindly to confirm improvements.
  • Analyzer: Aggregates results, flags weaknesses (e.g., baseline over-delivers 5-16 taglines; skill enforces exactly 3).
  • Skill Description Improver: Refines triggers for reliable invocation.

Workflow: Generates 3 test prompts + assertions, runs skill vs. baseline/no-skill, produces HTML report with grades, deltas (e.g., +0.333 pass rate), timings, tokens, and feedback box. Iterate: Review, feedback (e.g., 'Needs more cowbell'), optimize, retest. Minimal user input yields concrete outputs like numbered taglines in output.txt.

Assertions Are the Bottleneck—Eval Maker Fixes It

Skill Creator's vague assertion guidance (2 paragraphs: 'quantitative, verifiable, descriptive names') relies on Claude's intuition, risking irrelevant metrics like 'output has letters.' Good assertions must: (1) match skill's explicit/implicit promises, (2) check quality + avoidance of errors, (3) enable unambiguous grading.

Author's Eval Maker skill analyzes any SKILL.md, extracts purpose, links to best practices, and outputs interactive HTML:

  • Skill overview + quick fixes (e.g., define 'preserves meaning').
  • 3 test prompts: typical, minimal, stress (e.g., Tweet Trimmer: shorten tweets to <280 chars).
  • High-impact assertions with 'why it matters' (e.g., 'Key meaning preserved' prevents hallucination; 'Voice/tone match' retains casual register).
  • Copy-paste prompt feeds Skill Creator with evals.json.

Combo: Eval Maker defines metrics; Skill Creator measures. Setup takes minutes; paid bonus includes Claude Code Essentials pack with self-customizing skills.

Summarized by x-ai/grok-4.1-fast via openrouter

8211 input / 1597 output tokens in 12969ms

© 2026 Edge