Distinguish Harness Evolution from In-Context Learning

Self-evolving agents split into two approaches: auto-agent (e.g., AutoResearch, AutoAgent) improves the agent harness or model itself via a for-loop—define vision in PROGRAM.md, generate improvements with Claude/Codex, evaluate against baseline tasks, keep winners. Requires task database and programmatic verification, outputs frozen harness like fine-tuning. Less practical without eval datasets.

In-context learning (Claude Code, OpenClaw, Hermes Agent) grows runtime smarts via three pillars: memory (facts/user prefs), skills (domain procedures), history (conversation logs). Hot memory always in system prompt (e.g., index file <4k chars total); warm memory loads on-demand. Avoid full agentic loops—match architecture to use case: single LLM call for predictability, chained workflows for determinism, full agents for adaptability. Use HubSpot's cheat sheet to compare LLM systems by architecture, strengths, pitfalls.

Claude Code's 3-Layer Memory for Fact Retention

Evolves from bloated .claude.md (hot memory) to hot/warm split: index in .claude.md points to files loaded conditionally. Enable automemory: agent auto-saves worthies (user facts, feedback, project refs) to .claude/memory/ with memory.md as TOC, auto-loaded into prompt.

Limitation: prompt-reliant, memories outdated fast, pollute context. Fix with autodream (background post-session): new Claude session scans memories/history, consolidates/updates index async. Result: reliable fact memory without forgetting steps. Lacks proactive skills/history search.

OpenClaw & Hermes: Skills, Search, Autonomous Extraction

OpenClaw adds bootstrap.md (proactive user info extraction), daily logs, memory search tool (across files/history), skill search/add via ClaudeHub. Feels smarter via searchable history, on-demand skills—but needs human prompts.

Hermes Agent automates: after 10+ steps sans skills, sub-agent reviews for non-trivial trials/errors, creates skills via skill manager tool (add/patch/delete). Proactive prompt: patch outdated skills immediately. Safety scan rejects via patterns in skill_guard.py.

Memory: user.md (prefs/habits), memory.md (project facts) as hot (<4k chars); skills as warm; SQLite DB for raw history (session search); optional semantic layer (Mem0/Tahome). After 10 turns sans extraction, memory reviewer sub-agent updates hot files.

Outcome: agent self-improves domains/facts/history autonomously, outperforming prompt-dependent setups.

State-of-the-Art Blueprint & Quick Wins

Core: hot memory index, warm skills/docs, searchable history DB; async sub-agents every 10 turns/steps for extraction/updates. Plug into OpenClaw/Claude: self-improving skill uses learnings/ folder (errors, features), hooks (user-prompt-submit, post-bash for error detection), self-improvement reminder.md in prompt. One-command migration from OpenClaw.

Build your harness this way: prioritize skills for procedures over facts; async beats real-time to avoid blocking/speed loss. Simpler than auto-agent loops, scales with usage.