Claude Code's Modular Engine Design Enables Free Swaps
Claude Code acts as an agentic harness—a 'car' framework that orchestrates folder organization, tool usage, planning, and project building—while the LLM is the swappable 'engine.' Default engines (Opus, Sonnet, Haiku) incur Anthropic API costs for tokens and context. Swap them with open-source engines via local hosting or free cloud proxies to eliminate ongoing fees. Initial $5 Anthropic credits are needed for onboarding but unused afterward, as requests route to free models. This complies with Anthropic's terms since only their harness is used.
Open-source models are downloadable and modifiable, unlike locked closed-source ones (Sonnet, o1, Gemini) accessible only via paid APIs. Benchmarks like SWE-Bench show the performance gap closing: top open-weight models (e.g., Qwen2.5, Gemma2) outperform Sonnet 3.5 and rival smaller closed models, especially for coding. Google's Gemma2 excels in ELO scores at minimal size (e.g., 9B parameters, 6.6GB), ideal for local runs on modest hardware.
'Claude Code is the car and the chat model, the AI model is the engine... we basically just open up the hood and we switch out the engine.'
Selecting Open-Source Models by Hardware and Task
Match models to your RAM, CPU/GPU: smaller (3B-9B params, 2-7GB) for laptops; larger for desktops/servers. Use OpenRouter's programming rankings or Ollama's library for benchmarks comparing to closed models. Prioritize 'tools' and 'thinking' badges for agentic compatibility; check context windows (aim 64k+ for Claude Code prompts) and quantization (q6/q8 for speed/accuracy balance).
Common pitfalls: Models untrained on Claude tools may mishandle JSON protocols or tool calls; small contexts overflow on project scans; local runs slow without GPU. Test via ollama run modelname chat. Ask Claude Code: 'My hardware: specs. Recommend Ollama model sizes.' Gemma2/Qwen2.5: high ELO, low size; MiniMax/Mistral for cloud.
Quality criteria: Visible step-by-step tool calls (read/write/edit); coherent multi-step plans; handles 10k+ token projects without hallucination. Before: Opaque spinning, misspellings (e.g., 'Quen' file). After context tweak: Full visibility, accurate file creation with jokes.
'There's always been a gap between the performance of closed source models and the performance of open source models. But that gap is just shrinking and shrinking.'
Local Ollama Setup: Private, Unlimited Runs on Your Machine
- Download Ollama from ollama.com for your OS (Windows/Mac/Linux); install and launch.
- In VS Code terminal (or system terminal):
ollama pull qwen2.5:7b-instruct-q6_K(e.g., 6.6GB Qwen2.5 9B; adjust for hardware: 3B for low RAM). - Test:
ollama run qwen2.5:7b-instruct-q6_K→ Chat 'hi' for reasoning response. - Increase context if needed:
ollama create qwen2.5:9b-64k --from qwen2.5:7b --param num_ctx=65536(query Claude for OS-specific command). - Launch Claude Code: In Ollama app, copy
ollama launch claude --model qwen2.5:9b-64k; paste in VS Code terminal. Select model during prompt. - Onboard Claude Code (dark mode, API key → authorize Anthropic, buy $5 credits once). Switch model in settings.
Result: Fully local, private execution. Test: 'Analyze my project' → Scans folders; 'Create root file quen.txt with joke' → Writes accurately with tool visibility. Slower (4min/project scan on 9B model) but zero cost/latency.
For Ollama cloud models (no download): ollama run mistral-small (free tier limited; upgrade for concurrency).
'This is completely free because the model is running right down there on my desktop.'
Cloud OpenRouter Setup: Faster Access Without Local Hardware
- Sign up at openrouter.ai; get free API key (unlimited low-tier models).
- In Claude Code .env or settings:
ANTHROPIC_BASE_URL: "https://openrouter.ai/api"
ANTHROPIC_AUTH_TOKEN: "YOUR_OPENROUTER_API_KEY"
ANTHROPIC_API_KEY: ""
ANTHROPIC_MODEL: "openrouter/free"
ANTHROPIC_DEFAULT_SONNET_MODEL: "openrouter/free"
ANTHROPIC_DEFAULT_OPUS_MODEL: "openrouter/free"
ANTHROPIC_DEFAULT_HAIKU_MODEL: "openrouter/free"
ANTHROPIC_SMALL_FAST_MODEL: "openrouter/free"
CLAUDE_CODE_SUBAGENT_MODEL: "openrouter/free"
- Relaunch Claude Code; it proxies 'free' tier (rotates top open models like Qwen/Mistral).
Benefits: Near-Sonnet speed, full tool visibility, runs skills/agents (e.g., morning coffee demo spawns 4 subagents fast). Drawback: Not fully private; free tier rate limits heavy use.
'You can see that came back way way quicker... This almost feels like I'm actually using sonnet in cloud code.'
Tradeoffs: Balance Cost, Speed, Privacy, and Reliability
Local Ollama: Infinite free/private/unlimited; slow on small hardware; opaque tools without config; best for low-stakes/high-volume (summarize files, grep code, scaffold, triage emails/CRM). Cloud OpenRouter/Ollama: Faster/better models; eventual costs (subscriptions/VPS); suits research/classification/simple bugs.
Avoid for high-stakes coding (use Opus); fallback when Anthropic down (status.anthropic.com). No true 'free'—invest in hardware/VPS for scale. Optimize: Chain open models for prep (e.g., filter context) → closed for finals.
Prerequisites: VS Code, terminal comfort, basic hardware (8GB+ RAM). Fits early AI agent workflows: Prototype locally, scale to paid.
Practice: Pull 3 models (3B/7B/9B); benchmark project analysis time/accuracy; tweak contexts; compare OpenRouter vs local on bug fix task.
Key Takeaways
- Download Ollama, pull Qwen2.5:7b (6GB), launch via
ollama launch claude --modelfor instant local Claude Code. - Tweak context with
ollama create ... --param num_ctx=65536to enable tool visibility and statefulness. - Use OpenRouter .env config with 'openrouter/free' for cloud speed without hardware upgrades.
- Select models by SWE-Bench rankings and size: Gemma2/Qwen for efficient coding agents.
- Reserve open-source for low-stakes (summaries, searches, scaffolding); verify high-stakes with Opus.
- Initial $5 Anthropic fee unlocks harness; zero ongoing costs with swaps.
- Test compatibility: Ensure 'tools/thinking' badges and JSON adherence.
- Chain models: Open-source preprocess → closed finalize for cost optimization.
- Monitor: Local slower but private; cloud faster but metered.
- Benchmark your setup: Time project scans, check tool calls for quality.