The Case for Constraint in Small Models
Building a local coding assistant reveals that small models (7B-8B parameters) fail when tasked with complex, open-ended workflows. Instead of using a "God Prompt" that gives the model access to every tool and file, success comes from strict topology and reduced decision-making. The author implements a three-node architecture:
- The Architect: Handles user intent and task delegation; has no filesystem access.
- The Explorer: A read-only agent that compiles a focused "Context Map" for the task.
- The Coder: A hyper-focused engine that only performs file modifications based on the provided context.
By stripping navigation and search responsibilities from the Coder, the model can dedicate its limited parameter weight entirely to code synthesis.
Deterministic Context vs. LLM Search
Allowing local models to wander the filesystem is slow and prone to hallucination. The author replaces agent-driven searching with Context Loom, a deterministic SQLite-backed indexer. Before the agents start, Context Loom scores files based on explicit rules (e.g., +40 for tagged paths, +8 for Git-modified files). This approach is significantly faster than letting a model perform multiple tool roundtrips, and it provides a transparent "Evidence Ledger" that explains why specific files were selected for the context window.
Infrastructure and Engineering Trade-offs
Building a performant local agent requires managing hardware constraints and communication overhead:
- Hybrid Architecture: The project uses a Rust-based CLI for a responsive TUI and a Python-based core for agent logic, connected via PyO3.
- File-Based IPC: Instead of complex gRPC or socket layers, the system uses temporary JSONL files for progress tracking and control. This is robust, easy to debug, and avoids blocking the main thread.
- VRAM Management: The author warns against maximizing context windows. Large windows often trigger CPU offloading, which destroys generation speed. The optimal context window is the largest one that fits entirely within GPU VRAM.
- Defensive Runtime: To handle the inconsistent output of smaller models, the system uses Pydantic-based validation. It treats JSON formatting as a presentation layer (lenient) while enforcing strict schema contracts (strict) to prevent malformed responses from derailing the execution loop.