Stateful Cognitive Offloading
Most search agents struggle because they attempt to manage both high-level search strategy and low-level bookkeeping (tracking evidence, deduplication, and state maintenance) within a single, growing transcript. Harness-1, a 20B model built on gpt-oss-20b, addresses this by implementing "stateful cognitive offloading." The model acts as a policy that makes semantic decisions, while an external state-machine harness manages the "working memory" and routine operations.
The Harness Architecture
The harness maintains a recoverable state that includes:
- Candidate Pool: A deduplicated set of retrieved documents.
- Curated Set: A final output set (capped at 30 documents) tagged by importance (very_high, high, fair, low).
- Evidence Graph: A structure that uses regex extraction to identify frequent entities and bridge documents, helping the agent identify follow-up leads.
By offloading this state, the agent avoids the performance degradation associated with managing complex bookkeeping inside the prompt. The harness provides eight tools to the model, including fan_out_search, curate, and verify. A key design choice is the "warm-start" mechanism, where the first successful search auto-seeds the curated set, shifting the agent's task from building from scratch to iterative refinement.
Training and Performance
Harness-1 uses a two-stage training process:
- Supervised Fine-Tuning (SFT): Teaches the model how to operate the harness interface using 899 trajectories generated by a GPT-5.4 teacher.
- Reinforcement Learning (RL): Uses on-policy CISPO with a terminal-only reward to optimize search decisions. A critical addition is a "tool-diversity bonus," which prevents the agent from collapsing into repetitive search patterns; without this, curated recall plateaus significantly lower.
In benchmarks across web, finance, and patent data, Harness-1 achieved an average curated recall of 0.730, outperforming other open models and trailing only frontier-scale models like Opus-4.6. Notably, the model showed strong generalization, with a 2.2x larger gain on held-out benchmarks compared to tasks similar to its training data.