Microsoft ExP: A/B Tests Expose 1/3 Feature Success Rate
Microsoft's Experimentation Platform (ExP) enabled A/B testing on high-traffic sites, shifting culture from HiPPO to data-driven decisions—yet only 1/3 of tested ideas improved key metrics, humbling preconceptions.
Cultural Barriers to Data-Driven Product Decisions
Microsoft's product teams historically relied on HiPPO (Highest-Paid-Person's Opinion) or gut feelings for feature prioritization, leading to inefficient development. Ronny Kohavi proposed the Experimentation Platform (ExP) in 2005, inspired by Ray Ozzie's memo emphasizing closed-loop measurement for web services. Technical scalability for sites like MSN homepage was solvable, but cultural resistance proved harder: teams feared failure, preferred analysis over testing, and misunderstood statistics (e.g., dismissing sample sizes despite millions of users). ExP's dual mission—build an easy-to-integrate platform and foster data-driven culture—started as a 7-person incubation in 2006. Adoption grew from 2 experiments in FY2007 to 44 in FY2009 across 20 properties like MSN Homepages, Office Online, and Support.microsoft.com.
Testimonials highlight the shift: one team noted experiments "dispelled long held assumptions about video advertising" and changed feature prioritization; MSN UK ditched "opinion, gut feeling" for statistical data; another called ExP "essential for the future success of all Microsoft online properties." ExP tackled resistance through education (monthly seminars), weekly result emails for institutional memory, and proving value via quick wins, enabling teams to resolve debates with data rather than deferring to authority.
"We should use the A/B testing methodology a LOT more than we do today" – Bill Gates, 2008. This endorsement from leadership validated ExP, countering inertia where even Search teams underused statistical rigor pre-ExP.
Real-World A/B Tests Validate or Kill Ideas with Hard Metrics
ExP ran controlled experiments (randomized A/B tests) measuring Overall Evaluation Criteria (OEC) like revenue, engagement, or CTR to establish causality. Key insight: preconceptions fail—even experts guessed wrong. Examples across MSN properties:
- MSN Real Estate Widget: Tested 6 designs for "Find a home" widget driving referral revenue. Only 3/21 ZAAZ designers predicted winner (Treatment 5, simpler search-like UI). Result: +10% revenue from higher clickthrough. Rejected flashier variants.
- MSN UK Hotmail Module: Control opened Hotmail in same tab (replacing MSN page); Treatment used new tab. On 1M users over 16 days: +8.9% clicks per user on MSN homepage, boosting engagement. Rolled out to UK/US. Site manager: data flipped team rejection.
- MSN Entertainment Video Ads: Pre-roll (Control) vs. post-roll (Treatment). OEC: 6-week user return rate on cohort. +2% returns insufficient vs. -50% ad impressions. Bonus: Cutting ad interval from 180s to 90s insensitive to users, significantly boosting annual revenue—deployed globally.
- MSN Homepage Ads: Adding 3 below-fold ads projected $10k+/day but risked UX. Monetized page views/clicks via SEM costs. On 5% traffic (12 days): -0.35% relative CTR and page views/user-day. Lost value > ad gains; idea killed.
- Support.microsoft.com Personalization: Generic top issues (Control) vs. browser/OS-specific (Treatment). +50% CTR; proved simple personalization value, leading to core system integration.
- MSN US Search Header: Magnifying glass (Control) vs. words like "Search" (Treatments). +1.23% searches; actionable labels beat icons, aligning with Steve Krug's usability advice despite prior ignores.
- Pre-Bing Search Branding: Variant increased Search box clicks, searches, and page clicks—informed Bing launch design.
These spanned widgets, ads, personalization, UX—showing experiments resolve tradeoffs (e.g., revenue vs. loyalty) at scale. Multi-variant and cohort tracking handled complexity.
"Passion is inversely proportional to the amount of real information available" – Gregory Benford, 1980. Authors invoke this to explain heated debates quelled by data.
ROI Reality: Low Success Rates Demand Rigorous Testing
ExP's ROI: Accelerated innovation by pruning bad ideas early. Sobering stats: ~1/3 of tested ideas improve intended metrics, matching industry (Amazon <50%). Internal evaluations pass most ideas, but experiments reveal failures—bias toward uncertain ideas doesn't fully explain. Launching without tests misses small effects (external noise dominates sequential observation) and backouts cost more.
Pre-ExP, Microsoft underused experiments outside Search/MSN; no consistent stats. ExP centralized expertise for scalability. Humans intuit poorly (e.g., pattern-seeking loses to simple frequency guessing, per psych studies). Tradeoff: Experiments add upfront time but avoid sunk costs.
"The fascinating thing about intuition is that a fair percentage of the time it's fabulously, gloriously, achingly wrong" – John Quarto-vonTivadar, FutureNow. Underscores why ExP's data trumps HiPPO.
Progress: FY2007: 2 expts; FY2008: 8; FY2009: 44. Search evolved independently with ExP stats. Cultural wins: Teams now prioritize via data, share learnings.
Key Takeaways
- Run A/B tests on all major features using OEC to causally measure impact—randomization ensures differences stem from changes.
- Define monetized OEC for tradeoffs (e.g., assign $ to clicks/page views via SEM) to compare revenue vs. UX.
- Expect ~1/3 success rate; test uncertain ideas early to kill losers before full rollout.
- Overcome culture via examples, education, leadership buy-in (Gates/Ozzie), and quick wins—share results widely.
- Use cohorts/long-term tracking for retention; multi-variant for design contests.
- Personalization/UX tweaks (e.g., labels > icons, tabs > replaces) yield outsized gains—test assumptions.
- Centralize platform for stats expertise/scalability; avoid sequential launches (noise hides signals).
- Institutionalize: Weekly emails, seminars build memory/advocacy.
- Simpler often wins (e.g., search-like widgets, post-roll limits).
- Stats matter: Millions of users still need significance tests.