Slash LLM Token Costs 10x by Fixing 6 Bad Habits

File Formats Are Your Biggest Beginner Token Trap

Raw PDFs, images, and screenshots explode token counts because LLMs encode binary structure, headers, footers, fonts, and layout metadata. A newbie drags in three 1,500-word PDFs (4,500 words total) and asks Claude to "Summarize these." What should be ~5,000 tokens balloons to 100,000+ due to formatting overhead. This waste compounds as the bloated context bounces back in every turn, filling your window fast.

Fix: Convert to markdown first. Free web tools or a quick Claude prompt strips junk, yielding 4-6,000 clean tokens—a 20x saving. The speaker built a plugin for Open Brain ecosystem: ingest file, hit "transform," get markdown. For 99% of cases, you only need text, not style. Trade-off: Lose visual fidelity, but gain speed and cost control. He calls file formats "designed to be human readable, not AI readable."

"4500 words of content can become a 100 plus thousand tokens if you're not careful all you have to do to avoid that is just think in terms of markdown... saving you 20x on the memory." (Context: Explaining PDF bloat; this quote shows why rookies hit limits in one chat.)

Conversation Sprawl Wastes More Than You Think

Intermediate users sprawl chats to 20-40 turns, diluting original instructions amid noise. Models compress history but still resend the full context each turn—every reply costs the entire prior exchange. Mixing research, ideation, and execution in one thread confuses the model and burns tokens.

Fix: Separate modes. Use short, focused chats (10-15 turns max) for heavy work: gather intel in dedicated threads (Grok for X sentiment, ChatGPT for earnings, Perplexity for research, Claude for blogs), then synthesize in a final crisp prompt. Mark evolving chats upfront: "Our goal is to evolve and conclude together." End with "Summarize this." Start fresh often—long threads correlate with "LLM psychosis" as models drift.

Trade-off: More chats mean manual synthesis, but you avoid context dilution and get clearer outputs. Every turn resends history, so sprawling is like "filling up the context window with croft."

"Why make them suffer... why not just ask for what you want upfront... your objective... should be to be so clear that the AI needs to do nothing else and it just goes and gets the work done." (Context: Critiquing multi-mode sprawl; highlights single-turn design of LLMs.)

Plugins and Preloads: The Silent Context Tax

Loading 10+ plugins (e.g., Google Drive you never use) adds 50,000+ tokens before you type—every chat. It's like dumping every workshop tool on the bench before picking a hammer. Hype drives additions, but they barnacle on forever.

Fix: Audit ruthlessly. Use /context in Claude Code to check loads; disable unused connectors. Only equip 3-5 per task. For advanced setups, prune system prompts weekly—ditch lines from Claude 3.5 era.

Trade-off: Lose convenience for rarely used tools, but gain focus. Models pick wrong tools amid clutter.

Model Tiering Delivers 8-10x Savings Without Losing Quality

Using Opus (or GPT-4o) for everything—formatting, proofreading, execution—is overkill. A production pipeline the speaker reviewed analyzes long conversations across dozens of dimensions on frontier models, yet costs <25¢/user because they tier: Opus for reasoning, Sonnet for execution, Haiku for polish.

Example math (5-hour session, same output): Sloppy (raw PDFs, 30-turn sprawl, all-Opus): 800k-1M input tokens + 150-200k output = $8-10 ($5/M input, $25/M output). Clean (markdown, fresh chats every 10-15 turns, tiered models, scoped context): 100-150k input + 50-80k output = ~$1. Scale to 10-person team API: $2,000 vs. $250/month.

Trade-off: Test cheaper models per task; Haiku shines on polish but flops on complex reasoning. As models improve, lean out context—trust retrieval over frontloading.

"Don't bring a Ferrari to the grocery store." (Context: Model tiering; punchy metaphor for using Opus everywhere.)

Production Levers: Caching, Search, and Auditing

Advanced users screw up at scale (millions of tokens). Ignore prompt caching? Miss 90% discounts (Opus: $0.50/M cached vs. $5/M). System prompts bloat from unpruned cruft. Web search via native Claude burns 10-50k tokens/query vs. Perplexity (5x faster, structured citations).

Fixes: Cache stable context (prompts, tools, docs). Use MCP connectors for cheap search (e.g., Perplexity service). For agents/repos, test context needs per model gen—dumber models needed fat windows; now trim.

Jen-Hsun Huang pegs engineer token spend at $250k/year—don't be that person. With Mythos/GPT-next/Gemini (GB300-trained, 10x Opus pricing rumored: $50/M in, $250/M out), sloppy habits scale painfully.

"The models are not expensive it's your habits that cost a lot... your mistakes scale with the price of intelligence." (Context: Thesis opener and closer; frames costs as behavioral, not inherent.)

Tools to Diagnose and Fix Your Usage

Speaker built a "stupid button" (Open Brain plugin/skill/guardrails):

Audit prompt: Paste recent chat; flags raw docs, sprawl, model misuse, redundant loads—prioritizes fixes.
Gas tank skill: Measures per-session overhead (system prompts, plugins); before/after baselines.
Guardrails: Blocks token-waste on knowledge stores.

Run it: Answers 6 questions (raw files? Fresh chats? All-Opus? Preloads? Caching? Cheap search?). No setup for prompt version.

Real pipeline proves frontier AI viability: Dozens of analysis dimensions on long convos, personalized output, <25¢/user.

"Frontier AI can be absurdly cheap when you know what you're doing... most of us are spending more than we need to on AI." (Context: Production example; counters "AI is too expensive" narrative.)

Key Takeaways

Convert all inputs to markdown: 20x token savings on docs/images; use free tools or Claude.
Cap chats at 10-15 turns; separate research from execution for clarity and cost.
Audit plugins/preloads weekly: Disable barnacles adding 50k+ tokens/chat.
Tier models: Opus reasoning, Sonnet execution, Haiku polish—8-10x cheaper same output.
Cache stable context: 90% off repeated inputs; essential for agents/production.
Use cheap search (Perplexity/MCP): 10-50k fewer tokens/query, faster results.
Prune system prompts biweekly; trim context as models smarten.
Baseline usage with audits: Turn $10/day slop into $1/day efficiency.
Prep for 10x pricier models: Habits today dictate ROI tomorrow.
Build token smarts: $250k/year engineer spend is avoidable skill gap.