AI's Jagged Smarts: Verifiability Drives Progress

Vibe Coding Marks the Agentic Leap

Around December, LLMs crossed a threshold: agents now build entire apps end-to-end without fixes. Karpathy describes 'vibe coding'—describing outcomes in natural language, trusting the model to handle implementation. No more snippet-pasting; prompts steer coherent workflows. Berman notes this shift hit frontier users hard, with models like those post-GPT-4 delivering flawless chunks that chain into full software.

Example: OpenClaw installation ditched complex bash scripts for a simple agent prompt: copy-paste text listing tools and desired outcome. The agent inspects the environment, debugs loops, and installs across platforms. Products like here.now and Journey Kits ship 'agent-native' setups—minimal text like 'Install here.now web hosting for agents via npm, or fetch npm if missing.' Agents figure out the rest, shrinking install files from pages to paragraphs.

"I can't remember the last time I corrected it... I trusted the system more and more and then I was vibe coding."

This demands rethinking app dev: describe results, not steps. Traditional code bloats with edge cases; agents leverage trained weights for intelligence.

LLMs as Software 3.0: Prompts Program the New Computer

Karpathy frames LLMs as a paradigm shift—Software 3.0—beyond Software 1.0 (explicit rules) and 2.0 (dataset-trained nets). Train on internet-scale data to multitask implicitly, then 'program' via prompts and context windows. The LLM acts as CPU (model weights process), RAM (context holds state), with peripherals like browsers and files unchanged.

Internet data 'programs' base capabilities; prompts/context interpret and compute in digital space. Berman highlights Karpathy's 2021 tweet visualizing this: audio/video in, peripherals out, LLM core replacing OS.

"Software 3.0 now is kind of about your programming now turns to prompting and what's in the context window is your lever over the interpreter that is the LLM."

Build teams pivot: prioritize prompt engineering over rule-writing. Verifiable outputs (code compiles, math checks) amplify this, as RL rewards sharpen peaks there.

End-to-End Neural Nets Eclipse Traditional Code

Karpathy urges end-to-end nets over hybrid rules + nets. His menu-photo app—OCR text, generate images, overlay via Vercel—became obsolete. New way: feed photo to Gemini with prompt 'Use Nanobanana to overlay menu items.' Multimodal model handles OCR, generation, compositing in pixels.

This 'outward creep' of nets means rethink stacks: skip LLM-for-one-task + traditional code. Elon Musk's Tesla autopilot proves it—scrapped rules (e.g., 'red stop sign = stop') for pure end-to-end nets trained on data. Post-switch, performance soared, maintenance simplified. The Bitter Lesson: scale nets with compute/data beats human heuristics.

"All of my menu gen is spurious. It's working in the old paradigm... your neural network is doing more and more of the work."

Future: no traditional code; vibe-code entire apps. We're pre-Software 2.0 fully, but trajectory points there.

Verifiability Explains AI's Jagged Edges

AI's 'smart-dumb' duality stems from verifiability: LLMs automate what outputs verify easily, no full specs needed. Traditional software needs step-by-step rules; LLMs thrive on checkable artifacts (code runs? math equals?). Frontier labs treat training as giant RL environments, rewarding verifiable tasks like code/math.

Code booms because: auto-verifiable (compile/run/errors), economic incentives (enterprises pay for 10-100x dev speed), data abundance. Labs RL-heavily there—Anthropic early. Result: refactors million-line codebases, finds zero-days, yet fails 'walk 50m to carwash?'

Strawberry 'r's count was patched, but common sense lags. Jaggedness proves no AGI: code skills don't generalize. Labs chase incentives; unverifiable domains stagnate.

"Traditional computers can easily automate what you can specify in code... LLMs can easily automate what you can verify."

"Show me the incentive and I'll show you the outcome."

Founder Strategy: Target Unverifiable or Fine-Tune Verifiable

Labs dominate obvious verifiable domains (math/code). Founders: seek verifiable niches for custom RL/fine-tuning with proprietary data—pull levers labs ignore. Or chase hard-to-verify high-value RL environments (Karpathy hints at one, vapes coyly).

Everything automatable eventually, but unevenly. Build agent-native: skills as copy-paste prompts. Matt Schumer's essay flags this pace reshaping work/economy.

"If you are in a verifiable setting where you could create these RL environments... you can use your favorite fine-tuning framework and pull the lever."

Key Takeaways

Switch to vibe coding: describe outcomes, not steps—agents handle implementation via trained intelligence.
Install agent-native: ship minimal prompt files (e.g., npm check + install) over bash bloat.
Go end-to-end: replace code pipelines with single multimodal prompts; heed Bitter Lesson, bet on nets.
Exploit verifiability: excel where outputs check automatically (code/math); expect jaggedness elsewhere.
Founders: fine-tune verifiable niches with your data; hunt non-verifiable RL goldmines labs skip.
Verify before generalizing: AI code/math prowess doesn't imply AGI—skills domain-bound.
Rethink stacks: LLMs as CPU/RAM; prompts as code in Software 3.0.
Test December models: agent workflows transformed—retry if last tried pre-winter.