long-form-interview· Stefan Thomke

At Booking.com, Innovation Means Constant Failure

In domains with novelty, intuition is wrong most of the time; the only defensible way to make decisions is to run the scientific method at massive scale — high-velocity incrementalism plus democratized, transparent, self-governing experimentation — where failure is the status quo and learning, not winning, is the KPI.

booking-comhbsstefan-thomkecold-callexperimentationab-testing88% confidence

Why this is in the corpus

Canonical HBS case on experimentation culture: ~1,000 concurrent tests, 20-30K/year, democratized launch + nuclear-button kill rights, and an explicit posture that "failure is the status quo" — a benchmark every operator building an experimentation program should know.

Summary for skimmers

Harvard Business School professor Stefan Thomke presents his case on Booking.com's experimentation engine. The case opens with a director of testing asking CEO Gillian Tans to approve a radical experiment — repaint the booking.com landing page to look like a Google search page, during Christmas peak, on a significant share of real users. Thomke uses the dilemma to unpack Booking's operating model: ~1,000 concurrent tests at any moment producing quadrillions of landing-page variants; anyone can launch without management permission but anyone can also "push the nuclear button" to kill; experiments must be broadcast inside the company before launching so others can criticize; Booking has learned that reasonable-seeming hypotheses are wrong ~9/10 times. He contrasts this with Microsoft Bing's $100M+ headline-text change, with Ron Johnson's untested JCPenney transplant of Apple Store playbooks, and with competitors who admit they cannot compete against this model because their committee cycles are slower than Booking's test cycles. Thomke generalises: big data gives correlations, experiments give causality — the two are complementary; "high-velocity incrementalism" accumulates into large performance changes; and "not winning is not losing" because failures feed the next hypothesis. Brick-and-mortar retailers (Kohl's) are adopting small-sample experimentation techniques that are technically more sophisticated than the large-sample digital kind.

Briefing

What survives the editorial filter

This page should feel like a smart colleague already listened for you and left only the operating logic worth keeping. Not everything said in the episode makes it through.

Trust signal

second_order_reporting

Guest type: theorist.

Best used for

HBS professor Stefan Thomke on Booking.com's experimentation engine — ~1,000 concurrent tests, democratized launch + nuclear-button kill, 9/10 hypotheses wrong, high-velocity incrementalism, big data for hypotheses vs experiments for causality, failure-as-status-quo culture.

Hold lightly

No explicit downgrade reason stored yet for this episode.

Principles

Durable claims that survive beyond the speaker's biography — each with explicit limits, transferability judgment, and evidence.

Principle

Transparency is how an experimentation org self-governs at scale

At ~1,000 concurrent experiments, no committee can gate everything — so transparency + distributed veto becomes the governance model, not a culture nicety.

Thomke frames this explicitly: "You can't have a committee overlook every single experiment if you want to operate at that scale — you need an organization that has these kinds of checks and balances and self-governance." Anyone launches without management permission. Anyone can push the "nuclear button" to kill. And anyone can critique pre-launch. The result is checks-and-balances without a bottleneck.

Use when: Scaling experimentation orgs above ~100 experiments/month where committees become the bottleneck.

Skip when: Small teams where a single reviewer can still see everything; or safety-critical work where distributed veto is insufficient.

If you want velocity, replace the experiment review committee with (a) mandatory internal broadcast and (b) universal kill rights. Both, not one.

“Anybody can launch an experiment without permission for management, but anybody in the company can also kill an experiment that somebody else launched. They call it pushing the nuclear button.”Stefan Thomke

“When you're trying to launch an experiment, you actually have to broadcast it first inside the organization so other people can look at your experiment, ask questions, criticize it.”Stefan Thomke

Durability: Durable governance pattern for high-throughput test orgs.

Principle

Big data gives correlations; experiments give causality — use both

Big data and experimentation are complementary: big data surfaces testable hypotheses; experimentation delivers causality. Teams that do one without the other ship bad decisions.

Thomke names three failure modes of decision-making under novelty: (1) relying on past experience — fails because novelty breaks priors; (2) relying on correlations in big data — fails because correlation ≠ causality; (3) copying other contexts — fails because context transfers don't (Ron Johnson, JCPenney). The right move is to use big data for hypothesis generation (patterns → testable hypotheses) and experiments for validation (cause → effect).

Use when: Any data team or product team feeding decisions with dashboard output.

Skip when: Research contexts where causal-inference methods (DiD, IV, RCT) are already the default — the gap does not exist.

Use big data to generate hypotheses; use experiments to test them. Do not let a correlation dashboard close a decision loop on its own.

“We have a lot of examples where things correlate very highly and there is no causal relationship... there's a big correlation, for example, between hand size and life expectancy... the underlying causal variable is gender.”Stefan Thomke

“Big data gives us correlations, but it doesn't give us causality.”Stefan Thomke

Durability: Durable; re-derived in every generation of analytics teams.

Principle

Not winning is not losing — failures feed the next hypothesis

When the base rate of success is ~10%, a culture that punishes failure punishes its own primary information source; failures must be explicitly framed as learning artifacts, not losses.

Thomke is explicit: "Failure is the status quo. Not winning is not losing." The signal this sends operationally: share failures across the organization, let the learning from one failed hypothesis seed the next one, and resist the natural manager instinct to treat a 10% hit rate as a performance problem. It's what makes high-velocity incrementalism compound — the feedback loop runs on the 90%, not the 10%.

Use when: Any organization building a high-volume experimentation program.

Skip when: Safety-critical domains where failure means harm, not information.

Engineer explicit rituals for sharing failed experiments; put the learning yield, not the win rate, on the team dashboard.

“Failure is the status quo. Because when nine out of 10 things fail, you're much more likely to run into a failure than not.”Stefan Thomke

“I always say not winning is not losing. It's not the same thing... there's a learning objective here. You can learn from these failures.”Stefan Thomke

Durability: Durable operating principle paired with an enduring cultural failure mode (punishing failure).

Principle

In domains with novelty, intuition is wrong about nine times out of ten

When a decision involves novelty, the base rate of being correct on first intuition is ~10%; acting on gut without testing therefore destroys expected value.

Thomke attributes the 9/10 wrong rate directly to Booking's internal experimental record. The same pattern shows up across Bing, Microsoft, Google, Netflix, Amazon — all of whom run the same playbook because the same law holds. The practical consequence: shift default from "ship the intuition" to "test the intuition, and expect failure".

Use when: Any team making product/pricing/UX decisions where the change involves novelty.

Skip when: Well-solved engineering or operational decisions with abundant historical data.

Recalibrate your decision process around a 10% base rate — if you're not running tests, you're not making decisions, you're making bets at negative expected value.

“Booking has learned over the years is that they're wrong about nine out of 10 times. That is, you have a hypothesis and the hypothesis seems very reasonable and you go out and test it, and then something really surprising happens.”Stefan Thomke

“The default is really that you're much more likely to be wrong than to be right.”Stefan Thomke

Durability: Durable — holds anywhere the decision involves novelty.

Frameworks

Reusable systems and operating models — including when they help and when they break.

Framework

High-velocity incrementalism: small changes, fast cadence, massive scale

The conventional "small change = small impact, big change = big impact" mental model is wrong for digital products; high-frequency incrementalism can deliver outsized compounded value, sometimes larger than any single big bet.

Thomke cites the Bing headline-text move: trivial UI change, >$100M additional annual US revenue, Bing's largest ever experiment. Booking runs ~1,000 concurrent tests and 20-30K/year, producing quadrillions of landing-page variants. The framework is: lots of small changes × fast speed × large scale = compounded big-performance change. Occasional big bets still happen, but the bulk is incrementalism.

Run many small tests (scope small enough to ship quickly and measure cleanly)
Move fast (week or faster cycle time)
Operate at scale (traffic volume sufficient for statistical power on small deltas)
Accept a ~10% hit rate but compound the hits
Occasionally make larger bets but do not let them dominate the portfolio

Use when: Digital products with statistically-meaningful traffic and an existing experimentation infrastructure.

Skip when: Markets where one bet materially allocates capital — early-stage startups pre-PMF; offline businesses with sparse traffic.

Stop thinking "big change = big impact." Volume and speed of small experiments compound into breakthrough-level results.

“The game here is really what I call high velocity incrementalism. So what you do is you can make a lot of small changes, but you have to do 'em very fast. And you have to do them at very large scale.”Stefan Thomke

“That change alone resulted in more than a hundred million of additional revenue in the United States alone.”Stefan Thomke, on the Bing headline-text experiment

Durability: Durable framework; the specific volumes scale with business size.

Framework

Three-problem model of online travel: Traffic → Conversion → Experience

The post-traffic experience (conversion + completed trip) is where compounding returns live — pay Google once to get the user; the durable moat is not having to pay Google again.

Problem 1 — Traffic: solvable with money (a large P&L line paid to Google/Bing). Problem 2 — Convert: give the best possible experience; experimentation lives here; "conversion is where the proverbial rubber hits the road." Problem 3 — Travel experience: customer-support teams resolve overbookings and friction in real time, so the user has a great trip and returns organically. Skipping any step breaks the loop.

Problem 1 — Traffic: paid + organic acquisition
Problem 2 — Convert: experimentation-driven optimisation of the on-site funnel
Problem 3 — Deliver the experience: customer support + partner management so the user returns organically
Metric: the long-term KPI is conversion × retention, not session-level revenue

Use when: Any paid-traffic marketplace or consumer product with a re-purchase loop.

Skip when: Enterprise sales, where the three-stage model does not map cleanly.

Don't optimise conversion in isolation — the return on conversion work compounds only if post-purchase experience earns the next session organically.

“Booking really has to solve three big problems. The first problem is they have to get traffic... that problem can be solved with money. The second problem is once you get to their site, they need to convert you... the third problem, of course is after you've booked it, you're actually going to travel.”Stefan Thomke

“Then you come back and they don't have to pay money to Google to get you to come back.”Stefan Thomke

Durability: Durable; specific to marketplaces / consumer businesses but the template applies widely.

Signals

What appears to be shifting, for whom it matters, and what happens if you ignore it.

Signal

Physical retailers are now running experiments using small-sample statistical methods

If you think experimentation is a digital-only discipline, your competitive benchmark is out of date — the technique is now live in retail and needs different (harder) statistics.

Thomke cites Jim Manzi and work he has done with Kohl's-style retailers: they can't run million-store tests, but they use matched-control-store designs and small-sample statistics. These techniques — matched markets, DiD, synthetic controls — are actually more sophisticated than the large-sample A/B tests digital teams run. Thomke notes customers "are probably not aware of it when you go shopping."

Use when: Retail, CPG, franchised services — teams considering whether experimentation is feasible on their footprint.

Skip when: Teams already using matched-control-store designs — signal is already internalised.

Don't assume your low-traffic or offline business is out of reach for experimentation — the techniques exist, they are just statistically more demanding.

“It's not just about companies with digital roots, it's also about companies that don't have digital roots that are growing digital. It's also about companies that are running experiments in brick mortar environments.”Stefan Thomke

“The techniques that you use in small sample environments are much more sophisticated than the kind of statistical methods that you use in large sample environments.”Stefan Thomke

Durability: Signal with long shelf life; the adoption curve is still early.

Lessons still worth keeping

Useful takeaways that did not fully clear the bar for durable principle status.

Lesson

A trivial UI change can produce nine-figure revenue impact — do not pre-prioritise by visible magnitude

Small-looking ideas must be ranked by test cost, not by visible magnitude; a culture that filters by "does this feel important?" kills enormous value.

The idea came from a Bing employee; Bing runs 15,000+ tests a year; the idea sat in the queue for six months because no one thought it was important. The employee eventually wrote the change themselves and shipped it live. Microsoft's "too-good-to-be-true" anomaly alarm fired (the default assumption when a result is shockingly large is a bug). Re-run confirmed the result. It became Bing's single largest-ever experiment win.

Use when: Growth teams building experiment prioritisation queues.

Skip when: Non-digital contexts where experiment iteration cost is large enough that intuition-based prioritisation is forced.

Rank experiments by test cost, not by how "important" they look — the biggest wins historically came from ideas that looked trivial.

“Why don't we actually take some of the text that's below the headlines and just move it up to the headline... That change alone resulted in more than a hundred million of additional revenue in the United States alone. It ended up being actually the largest and most successful experiment that Bing has ever run.”Stefan Thomke

“The idea was essentially lingering for more than six months until finally the employee took matters into his own hands and just decided to make a few changes in the code.”Stefan Thomke

Durability: Durable; the cognitive bias it exploits does not go away.

The Plays

Try these this week

Verb-first executable actions — each one tied to a stated outcome in the episode.

Broadcast-and-kill experiment protocol: democratize launch + veto, replace the review committee

“Anybody can launch an experiment without permission for management, but anybody in the company can also kill an experiment that somebody else launched. They call it pushing the nuclear button.”
— Stefan Thomke

Ongoing. Ramp takes years — Booking built this culture from the mid-1990s. per

1
Grant every employee launch rights
No management approval required to start an experiment. Policy owned by the experimentation platform team, not individual managers.
2
Require mandatory internal broadcast before launch
Experiment proposal (hypothesis, target metric, scope, rollout %, timing) goes to a visible internal channel. Any employee can review and ask questions.
3
Grant every employee kill rights (the 'nuclear button')
Any employee can halt any running experiment. Kill events trigger a post-mortem, not a punishment — the default assumption is the killer saw something real.
4
Run a 'too-good-to-be-true' anomaly monitor on every experiment
Flag results outside plausible effect-size range as suspected bugs. Re-run before declaring a win. Modeled on Microsoft's TGTBT alarm, which caught bugs but also surfaced the Bing headline-text $100M result.
5
Publish failures by default
Failed experiments are shared with the same weight as wins; the learning, not the outcome, is the deliverable. Drives the next hypothesis into the pipeline.
6
Reserve 'radical' experiments for sensitive timing and scope conversations
Non-incremental experiments (e.g. Booking's Google-page test during Christmas peak) warrant a CEO-level timing/scope conversation — the only formal check that remains.

Scripts

broadcast-template

Hypothesis: [X]. Target metric: [Y]. Scope: [% of users, geo, segment]. Timing: [start/end]. Risks: [list]. Kill criteria: [thresholds on guardrail metrics]. Questions/objections welcome by [time] before launch.

Before you start

· Experimentation platform capable of 4-digit concurrent tests without interference
· Anomaly / too-good-to-be-true monitor on every experiment
· A cultural posture that failure is the status quo (or you cannot survive a 90% failure rate)
· Guardrail-metric definitions shared across the company

experimentation-opsgrowth-infrastructureorg-designgrowth-stagescalehyper-scale

Data-led geographic expansion: follow where your users already are, not where the market is

“They looked at the German market, which is a very big travel market. And of course the first logical decision that you would make is when you go to Germany and set up somewhere, you'd go to like Berlin or one of the other big cities. But they looked at the data. It turned out that actually the place where a lot of their Dutch travelers go is a small skiing village. So by just looking at the data, they ended up opening the first office in that small skiing village.”
— Stefan Thomke

Months, not quarters — a single data pull + 1-2 supply-side site visits per (proposed)

1
Pull your own customer data for the candidate expansion market
Do not start from TAM reports or competitor presence. Start with: where are my existing users spending money / time / nights in the candidate geography?
2
Rank destinations by actual customer volume, not by market-size convention
Small towns with disproportionate inbound from your user base will beat capitals on supply-side relevance. Booking picked a skiing village over Berlin on this basis.
3
Site the first office at the top of that ranked list
Locate in or adjacent to the demand pocket. Supply-side partnerships sign faster because your customer flow is visible to local operators.
4
Iterate the test with on-the-ground supply signups and conversion
If your supply-side sign-up rate in the chosen town exceeds the cohort-average by 2x+, the heuristic worked; if not, redo the data pull — you may have been measuring traffic, not value.

Before you start

· Cross-border booking/usage data at city or town granularity
· Willingness to contradict conventional-wisdom capital-city strategy
· Small enough team that site selection is reversible within a year

market-expansionsupply-side-strategygeographic-growthearly-stagegrowth-stage

Decision Moments

Actual decisions, real outcomes

Specific decisions narrated in the episode with their outcomes and transferable lessons.

Ron Johnson, having co-created the Apple Store, was hired to transplant his retail playbook to JC Penney. He rolled it out without structured testing of the transfer to a different brand, customer, and channel context.

Did: Applied the Apple playbook (pricing changes, store formats, brand language) at JCPenney without experimentation on whether it would transfer.Outcome: JCPenney was fighting for its survival; Johnson exited.

Celebrity operator reputation encodes correlation with past outcomes, not causation in the new context. Never transplant a playbook across contexts without the experiment that would prove it transfers.

Part of an emerging decision pattern across multiple episodes

Booking.com needed to pick the location of its first office outside the Netherlands. The obvious choice for the German market was Berlin or another major city.

Did: They pulled their own customer data and opened the first German office in a small skiing village — because that's where their Dutch travelers were actually going.Outcome: The decision compounded into supply-side relationships and an operating edge that defied conventional-wisdom geography.

Let revealed customer behavior, not market-size conventional wisdom, pick your first expansion location.

Part of an emerging decision pattern across multiple episodes

A Booking.com director of testing wanted to repaint the landing page to look like a Google search page — during the Christmas peak travel season, on a significant share of real users, at a time when Google was moving into travel.

Did: The case dilemma: encourage the director to proceed, intervene to modify scope/timing, or stand aside on principle of the testing culture.Outcome: Thomke uses this as the teaching question — the implicit resolution is that radical experiments still deserve CEO-level timing/scope conversations even in a democratized test culture.

Radical tests are the one place where "leave the organization to self-govern" breaks; timing and scope of high-blast-radius experiments warrant explicit executive engagement.

Part of an emerging decision pattern across multiple episodes

A Bing employee had an idea to move text from below a headline up into the headline. The idea sat in the experiment queue for six months because no one thought it was important.

Did: The employee wrote the change themselves and shipped it live without waiting further. Microsoft's too-good-to-be-true alarm fired (default assumption: bug). They re-ran; result was real.Outcome: The experiment produced >$100M in additional US revenue and became the largest, most successful experiment Bing had ever run.

Queue-prioritization by visible magnitude systematically under-weights the experiments that actually deliver outsized wins — give employees the rails to bypass prioritization theatre.

Part of an emerging decision pattern across multiple episodes

Tensions surfaced

Contradictions and trade-offs the episode raises — judgment calls a thoughtful operator has to navigate.

Tension

Tension: Test-everything culture vs strong-product-conviction

Test-everything and conviction-first product cultures both work — but they produce structurally different products.

Mechanism: test-everything optimizes the local maximum (every change is improvement). Conviction-first jumps maxima (radical changes that test-cultures would reject). The two operating modes produce different product trajectories.

“Test everything is the right doctrine for incremental improvement. Conviction-first is the right doctrine for category creation. They produce different companies.”Booking.com narrator paraphrased

Durability: Durable. The test-vs-conviction pattern is structural to product doctrine.

Productive tension with named opposite-archetypes.

Corpus connection

Where this episode fits for retrieval

What kinds of decisions this briefing is best pulled into.

Primary decisions

• experimentation
• org-design
• decision-making

Temporal flag

timeless