Stateful Cognitive Offloading

Most search agents struggle because they attempt to manage both high-level search strategy and low-level bookkeeping (tracking evidence, deduplication, and state maintenance) within a single, growing transcript. Harness-1, a 20B model built on gpt-oss-20b, addresses this by implementing "stateful cognitive offloading." The model acts as a policy that makes semantic decisions, while an external state-machine harness manages the "working memory" and routine operations.

The Harness Architecture

The harness maintains a recoverable state that includes:

  • Candidate Pool: A deduplicated set of retrieved documents.
  • Curated Set: A final output set (capped at 30 documents) tagged by importance (very_high, high, fair, low).
  • Evidence Graph: A structure that uses regex extraction to identify frequent entities and bridge documents, helping the agent identify follow-up leads.

By offloading this state, the agent avoids the performance degradation associated with managing complex bookkeeping inside the prompt. The harness provides eight tools to the model, including fan_out_search, curate, and verify. A key design choice is the "warm-start" mechanism, where the first successful search auto-seeds the curated set, shifting the agent's task from building from scratch to iterative refinement.

Training and Performance

Harness-1 uses a two-stage training process:

  1. Supervised Fine-Tuning (SFT): Teaches the model how to operate the harness interface using 899 trajectories generated by a GPT-5.4 teacher.
  2. Reinforcement Learning (RL): Uses on-policy CISPO with a terminal-only reward to optimize search decisions. A critical addition is a "tool-diversity bonus," which prevents the agent from collapsing into repetitive search patterns; without this, curated recall plateaus significantly lower.

In benchmarks across web, finance, and patent data, Harness-1 achieved an average curated recall of 0.730, outperforming other open models and trailing only frontier-scale models like Opus-4.6. Notably, the model showed strong generalization, with a 2.2x larger gain on held-out benchmarks compared to tasks similar to its training data.