Recursive Self-Improvement Builds Task-Specific Harnesses

Poetiq’s Meta-System creates optimized inference harnesses—layers handling prompting, output structuring, multi-call assembly, and evaluation—without fine-tuning base models or accessing internals, using only standard API calls. For LiveCodeBench Pro (LCB Pro), a contamination-resistant C++ coding benchmark testing accuracy, runtime, and memory on Easy/Medium/Hard problems from competitions, the system starts with Gemini 3.1 Pro to recursively refine strategies: better question chains, answer assembly, and constraint handling. It draws from prior optimizations on ARC-AGI (reasoning) and Humanity’s Last Exam (retrieval), proving self-improvement across LLM task categories. The resulting harness is model-agnostic, applying unchanged to other providers' models, open-weights or proprietary, satisfying three goals: efficacy gains sans fine-tuning, automatic harness creation, and cross-model portability.

Universal Gains Across Models and Difficulty Tiers

Applied post-optimization on Gemini 3.1 Pro (78.6% → 90.9% overall, Hard: 7.7% → 58.3%), the harness improved every tested LLM on LCB Pro (25Q2 leaderboard). GPT 5.5 High hit 93.9% (up 4.3% from 89.6%, Hard: 50.0% → 75.0%). Smaller models shone: Gemini 3.0 Flash rose 10 points (72.3% → 82.3%), beating larger Claude Opus 4.7, baseline Gemini 3.1 Pro, and GPT 5.2 High. Kimi K2.6 jumped ~30 points (50.0% → 79.9%); Nemotron 3 Super 120B gained 12.8%. All tiers benefited, with largest gaps closing on Hard problems where procedural logic and constraints matter most. Poetiq cross-validated baselines against official leaderboard scores at livecodebenchpro.com.

Why Harnesses Outperform Baselines Without Model Changes

Hand-built harnesses demand engineering effort; Poetiq automates this via meta-optimization, turning coding—a blend of reasoning, retrieval, and logic generation—into a harness-driven strength. Gains stem from tailored orchestration: sequential prompting for complex C++ solutions meeting runtime/memory limits, not just correct output. This enables economical models to punch above weight, as seen previously on ARC-AGI, and scales to commercial coding apps without provider-specific tweaks.