Shopify's AI Surge: Custom Tools Beat Hype

Shopify CTO Mikhail Parakhin details near-100% internal AI adoption post-Dec 2024, unlimited Opus-4.6 tokens, and tools like Tangle, Tangent, SimGym that make ML reproducible, auto-optimized, and customer-simulatable—revealing review loops and CI/CD as true agent bottlenecks.

Shopify's Internal AI Explosion and December Inflection

Mikhail Parakhin, Shopify's CTO, shares exclusive data on the company's AI adoption: nearly 100% of employees now use AI tools daily, up from modest growth pre-2024. A December 2024 "phase transition"—driven by model quality jumps—sparked exponential token usage and tool adoption. CLI-based agents like internal "River" outpace IDE tools (e.g., Copilot, Cursor) in growth, as developers favor non-visual, high-speed interactions. Shopify funds unlimited tokens on top models like Claude Opus-4.6 or GPT-5.4, enforcing a floor of high-quality models while observing skewed usage: top 10% users consume disproportionately more, hinting at early power users dominating before broader diffusion.

Parakhin attributes this to Shopify's maturity as a $200B, 20-year-old firm going "all-in" on AI, now vocal about internals like Tobi's QMD and his SQLite preferences for agent data storage—echoing Andrej Karpathy's recent agent context queries.

"It approaches really 100% by now. It’s hard to do your job now without interacting deeply, at least with one tool." — Mikhail Parakhin on daily AI active users.

Token Budgets Right, But Critique > Parallel Agents

Jensen Huang's push for engineer token quotas (e.g., 100K tokens/year per $200K engineer) gets Parakhin's endorsement as "directionally correct," countering critics likening it to lines-of-code metrics. However, raw tokens mislead: anti-patterns like parallel non-communicating agents burn tokens inefficiently. True unlocks are critique loops—where one agent generates, another (ideally on a stronger model) critiques and iterates—yielding higher-quality code despite longer latency.

AI generates cleaner code than average humans but floods production with sheer volume, spiking bugs. Parakhin stresses spending more on review (e.g., GPT-5.4 Pro, Gemini Deep Think) than generation: few tokens generated slowly via debate beats swarms. Off-the-shelf tools fail here, lacking pro-models and sequential depth, so Shopify built custom PR reviewers prioritizing rigor over speed.

PR merges grew 30% MoM (vs. 10% pre-AI), with rising complexity, but test failures, flaky CI/CD, and rollbacks now bottleneck deployments. Humans tolerate week-long reviews; AI can afford hours if it cuts aggregate cycle time.

"The important metric is the ratio of budget spent during code generation versus... expensive tokens... checking on PR reviews." — Mikhail Parakhin on balancing AI coding costs.

CI/CD and Git Need Agent-Era Reinvention

Traditional Git, PRs, and CI/CD—designed for humans—creak under machine-speed code volume. Stack diffs (via Graphite) help Shopify, but Parakhin foresees a paradigm shift: stabilize first (clamp bugs, speed tests), then rethink metaphors entirely. PR volume overwhelms; evicting bad PRs mid-pipeline eats time. No compelling replacements yet, but everyone's racing to adapt.

Tangle: Reproducible ML Beyond Airflow

Tangle addresses ML/data workflow irreproducibility: content-addressed caching creates team-wide network effects, making experiments collaborative and production-ready from day one. Unlike Airflow's DAGs (sequential, brittle), Tangle's idempotent, cache-first design enables reuse across repos—vital for Shopify's explosive AI growth.

Tangent: Auto-Research Democratizes Optimization

Tangent automates research loops for search, themes, prompt compression, and storage. PMs and domain experts now run AutoML-style experiments sans ML engineers, fulfilling long-promised "AutoML that feels real" in the LLM era. Limits persist (e.g., poor hypothesis generation), but paired with Tangle/SimGym, it compounds: reproducible evals on simulated data.

"Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers." — Mikhail Parakhin on broadening AI access.

SimGym: Historical Data Moat in Customer Simulation

SimGym simulates buyers/merchants using Shopify's vast historical trajectories— a defensible moat vs. generic sims. Evolved from A/B comparisons to live recommendations (e.g., "change this storefront element to lift conversions"). Multimodal models, browser farms, serving, and distillation make it costly; counterfactuals model interventions like discounts or notifications.

Category behaviors vary wildly (reviving Chinese Restaurant Processes for clustering). HSTU tracks trajectories; real data ensures fidelity over toy sims.

UCP, Liquid AI: Low-Latency Non-Transformer Wins

Shopify's UCP unifies catalog intelligence: runtime search, bulk lookups, identity linking. Liquid AI's non-transformer foundation models deliver sub-20ms latencies for query understanding, Sidekick Pulse, and scale workloads—first architecture Parakhin calls "genuinely competitive." Pragmatic model choice: merit-based, not hype-driven; Liquid scales with compute but Shopify stays open.

Bing Sydney Lessons and Hiring Push

Parakhin recounts shaping Bing's Sydney personality deliberately (not accidental), learning early AI character control. Shopify hires aggressively: ML, data science, distributed DBs for AI infra.

Key Takeaways

  • Enforce high-model floors (e.g., Opus-4.6+) with unlimited budgets; track critique-to-generation ratios over raw tokens.
  • Build custom PR reviewers with sequential pro-model debates; tolerate latency to cut CI/CD flakiness.
  • Use content-addressed caching (Tangle-style) for ML reproducibility; beats Airflow for collaboration.
  • Democratize via auto-research (Tangent): let PMs optimize without ML PhDs.
  • Leverage proprietary data for sims (SimGym): historical trajectories enable counterfactuals and moats.
  • Reinvent CI/CD metaphors for agents; stabilize human-era tools first.
  • Test non-transformers like Liquid AI for latency wins in production.
  • Monitor adoption skew: power users lead, but teach to distribute.
  • AI volume > human quality/quantity; review spend must scale accordingly.

"Good model writes code on average with fewer bugs than the average human. But since they write so much more of it... you have to have a very rigorous PR reviews." — Mikhail Parakhin on AI coding pitfalls.

"We probably need a different metaphor or different whole design of how to process CI/CD in new agentic world." — Mikhail Parakhin on dev tooling evolution.

Summarized by x-ai/grok-4.1-fast via openrouter

9419 input / 2866 output tokens in 23896ms

© 2026 Edge