Token Consumption Drives Unsustainable Costs

Output tokens cost 5-6x more than inputs, forming the core of inference expenses. Despite per-token prices dropping 10x, overall consumption rose over 100x, yielding 10x higher spending—e.g., from $50/1M tokens x 100M/month ($5k) to $5/1M x 10B/month ($50k). Complex tasks, non-deterministic agent behavior (loops, hallucinations), and 'tokenmaxxing' (maximizing spend for perceived gains) amplify this. Multi-agent systems boost reliability but consume far more tokens than single agents, with performance gains not scaling linearly (no 10x tokens for 10x quality). Anthropic capped $200/month Max plans after users hit $5k token bills, proving even big teams struggle. Without harness-level optimizations—curating context/tools/models—costs outpace revenue at scale.

Harness Tactics Slash Usage Without Sacrificing Results

Minimize context by trimming irrelevant details and using memory systems to forget unneeded info, keeping windows slim to boost focus, performance, and speed. Expose only task-specific tools to avoid wrong calls bloating context with large outputs (e.g., full datasets). Route statically via evals (Task A to cheap Model A) or dynamically with routers judging complexity (e.g., 'summarize' to simple model), saving on overkill for easy tasks. In multi-agent setups, orchestrators delegate to specialized subagents with tailored harnesses (prompts, tools, models), enabling parallelism for faster responses and less context per agent.

Cache input/output tokens from providers at lower rates for similar requests—ideal for static/repetitive workloads like repeated analyses—cutting regen costs and latency. Set 'thinking budgets' (e.g., Gemini's thinking_budget, Claude's budget_tokens) to match task complexity, plus output limits (e.g., 10k chars) or stop sequences. Batch non-urgent requests via OpenAI's Batch API (50% cheaper). Offload to server-side compute (e.g., DB queries for filtering) replaces model reasoning with deterministic ops, saving hundreds/thousands of tokens per request.

Trace Monitoring Reveals Optimization Paths

Track per-interaction traces with LangSmith/Langfuse for token/latency/tool insights, spotting high-burn steps for rerouting or agent specialization. Granular data flags issues like overthinking or identifies self-hosting thresholds—hybrid setups route heavy workflows to owned/open-source infra for long-term savings, especially sensitive tasks. Collective tactics don't yield huge single wins but compound to sustainable cost-performance balance as usage scales.