Eval-Driven Skills: Boost Agent Performance on Supabase
Use eval-driven development to craft agent skills: define metrics first, structure with progressive disclosure in skill.md, test via Braintrust evals on Supabase workflows, iterate to fix failure modes like unused skills or bad instructions.
Agent Skills Structure for Progressive Disclosure
Agent skills are folders containing a required skill.md file and optional references/scripts, designed to provide targeted context without bloating the agent's initial context window. The skill.md uses YAML frontmatter with name and description fields—these load first as an "envelope," enabling progressive disclosure: the agent decides when to fetch full content based on need.
Inside skill.md, add instructions, workflows, or links to files in a reference/ folder (Markdown for docs, scripts like Bash/Python for actions). This forms a graph—reference files can link others—acting like a book's index linking chapters. Scripts run locally (tied to your OS env, e.g., Linux/Mac compatible), unlike remote MCP tools.
Key principle: Skills deliver custom info/workflows too verbose for MCP tool descriptions. Example structure:
---
name: Department Stats Skill
description: Guides creating SQL views for dept salary averages and counts from profiles table.
---
To compute department stats:
1. Query `profiles` table.
2. GROUP BY `department`.
3. AVG(`salary`), COUNT(*).
Reference: [exact SQL template](./reference/dept-stats.sql)
Reference files are plain Markdown or scripts, e.g., dept-stats.sql with CREATE OR REPLACE VIEW department_stats AS SELECT department, AVG(salary), COUNT(*) FROM profiles GROUP BY department;. This setup teaches agents precise patterns, avoiding hallucinated SQL.
Common mistake: Overloading skill.md content—keep it concise; offload details to references. Bad pattern: Vague descriptions like "DB tools" lead to ignored skills. Good: Specific triggers, e.g., "Use when querying aggregates by department."
Skills vs. MCP Tools: Complementary for Integrations
Skills ≠ MCP tools. MCP (Multi-Context Provider) servers expose remote, env-agnostic tools (e.g., Supabase's 20+ tools: list tables, exec SQL, apply migrations, run DB advisor). Agent calls them directly—no local setup.
Skills augment with context: Define workflows (e.g., "Always test views post-creation"), docs, or local scripts. Use MCP for integrations (no bash access); skills for everything else.
Trade-offs:
| Aspect | Skills | MCP Tools |
|---|---|---|
| Env | Local (OS-specific) | Remote/server-side |
| Purpose | Context/workflows | Actions/tools |
| Loading | Progressive (frontmatter first) | Full desc in context |
In Supabase workflows, combine: MCP for DB ops, skills for schema-specific guidance. Misconception: Skills replace MCP—false; they stack for DAX (agent dev experience).
Pitfall: Scripts fail cross-OS (e.g., Windows-incompatible Bash). Solution: Prefer MCP for portability; reserve scripts for local prototyping.
Eval-Driven Development: Define Metrics, Test, Iterate
Test skills like code: Unit (manual runs), integration (evals), E2E (full workflows). With LLMs, use evals—nondeterministic tests evaluating reasoning/tools/steps, not exact output.
Adopt OpenAI's framework:
- Define metrics: What "good" means, e.g., "Correct SQL syntax (100%), Uses GROUP BY (90%), Calls apply_migration tool (80%)." Tailor to skill: Forwarding to docs? Workflow adherence?
- Build skill: Write
skill.md/refs/scripts. - Run evals: Input (task prompt), expected (tools/steps/output). Use Braintrust for observability—logs agent traces, scores metrics (pass/fail, LLM-as-judge).
- Grade/Inspect: Check tool calls, reasoning. Nondeterministic? Run 10-50x, avg scores.
- Iterate: Tweak (e.g., add examples), re-run.
Braintrust setup: Platform for evals; defines scenarios (input/expected), runs agent, visualizes traces. Like Datadog for agents. CEO quote (podcast): Emphasizes full behavior picture.
Manual testing baseline: Prompt agent (e.g., Claude) on Supabase demo app: "Create department_stats view: avg salary, count by dept." Without skill: Agent lists tables, crafts wrong SQL (e.g., joins wrong table), applies migration—view created but buggy (misses salary avg).
With skill: Agent references skill, uses exact template—correct view. App query shows dept breakdowns.
Quality criteria:
- Skill used? (Trace shows load).
- Performance delta: Baseline 40% success → Skill 85%.
- Holds under variants: Bad instructions drop to 20%; precise ones sustain.
Failure modes:
- Unused: Vague desc.
- Misleading: Conflicts MCP docs.
- Fragile: No examples, fails edge cases.
Demo repo (hudripppn/improve-skills-workshop-aieurope): Next.js app (performance reviews on Supabase Postgres), MCP.json for local server, seeded DB (employees/managers/HR). Setup: npx @supabase/create-supabase, npm run dev. Eval harness at end.
Exercise: Clone repo, baseline agent on reports view, add skill, run 20 evals via Braintrust—tune till 90%+.
Prerequisites: Agent familiarity (Claude/Cursor), Supabase basics (Postgres BaaS: DB/auth/storage/edge funcs). Fits mid-workflow: After agent prototyping, before prod.
Key Takeaways
- Start every skill with precise frontmatter description triggering use—vague ones get ignored.
- Combine skills (context) + MCP (tools) for Supabase: Skills guide workflows, MCP executes.
- Eval-driven: Define 3-5 metrics upfront (e.g., tool calls, SQL correctness) before writing.
- Use Braintrust for traces: Run 20+ evals/iteration; aim for 80%+ delta over baseline.
- Test bad patterns: Overload content, poor refs—quantify drops to validate fixes.
- Progressive disclosure principle: Frontmatter envelope + refs = scalable context.
- Local scripts? Prototype only—migrate to MCP for prod portability.
- Iterate cycle: Metrics → Skill → Evals → Grade → Repeat, like TDD for agents.
Notable Quotes:
- "Progressive disclosure is basically when the agent... loads the exact amounts of information that allows the agent to choose to load the rest... once it actually needs it." (Explaining skill.md design for context efficiency.)
- "Skills actually just provide more context to your agent... everything that you don't have space to define on the MCP tools descriptions you can define them on skills." (Clarifying skills' role vs. tools.)
- "You can basically do exactly the same as code testing. ... since we have an LLM in the loop, you'll have something called evaluations." (Mapping traditional testing to agent evals.)
- "The core loop of the workshop is simple: write a Skill, run evals, inspect results, and iterate." (From description; distills the method.)
- "If you're building anything that it's an integration, you should use MCP... skills actually just provide more context." (Practical usage rule.)