The Dual-Lever Improvement Loop

Most AI agents rely on static scaffolds (system prompts, tool-dispatch logic, and retry policies) and fixed model weights. Hexo Labs' SIA (Self-Improving AI) framework breaks this limitation by treating both the harness and the model weights as dynamic variables. The system employs three distinct LLM components to manage this evolution:

  • Meta-Agent: Generates the initial scaffold based on task specifications.
  • Task-Specific Agent: Executes the task and logs the full trajectory.
  • Feedback-Agent: Analyzes the trajectory to decide whether to iterate on the scaffold (external software engineering) or update the model weights (internal domain knowledge).

This approach allows the agent to interleave improvements freely rather than following a rigid, sequential training phase.

Adaptive Training Strategies

The Feedback-Agent does not rely on a single reinforcement learning recipe. Instead, it selects an optimization method based on the reward signal and failure mode observed during the task:

  • PPO with GAE: Used for clean, outcome-based scalar rewards (e.g., LawBench).
  • Entropic Advantage Weighting: Used when tasks have high failure rates, such as compilation errors in CUDA kernel generation, by up-weighting rare successful rollouts.
  • GRPO: Used for tasks where the value network can be eliminated entirely (e.g., denoising).

Performance Gains

Benchmarking across three domains—LawBench (classification), AlphaEvolve TriMul (CUDA kernel optimization), and RNA denoising—demonstrated that combining harness and weight updates (SIA-W+H) consistently outperformed harness-only (SIA-H) iterations.

For example, in the TriMul task, harness-only updates achieved a 1.14x speedup, while the addition of weight updates pushed the runtime reduction to 14.02x. In LawBench, weight updates provided a 20.1 percentage-point accuracy boost over the harness-only plateau. The framework is open-source (MIT license) and includes bundled tasks such as GPQA, LawBench, and LongCOT-Chess.