Principles
Durable claims that survive beyond the speaker's biography — each with explicit limits, transferability judgment, and evidence.
Principle
Transparency is how an experimentation org self-governs at scale
At ~1,000 concurrent experiments, no committee can gate everything — so transparency + distributed veto becomes the governance model, not a culture nicety.
Thomke frames this explicitly: "You can't have a committee overlook every single experiment if you want to operate at that scale — you need an organization that has these kinds of checks and balances and self-governance." Anyone launches without management permission. Anyone can push the "nuclear button" to kill. And anyone can critique pre-launch. The result is checks-and-balances without a bottleneck.
Use when: Scaling experimentation orgs above ~100 experiments/month where committees become the bottleneck.
Skip when: Small teams where a single reviewer can still see everything; or safety-critical work where distributed veto is insufficient.
If you want velocity, replace the experiment review committee with (a) mandatory internal broadcast and (b) universal kill rights. Both, not one.
“Anybody can launch an experiment without permission for management, but anybody in the company can also kill an experiment that somebody else launched. They call it pushing the nuclear button.”Stefan Thomke
“When you're trying to launch an experiment, you actually have to broadcast it first inside the organization so other people can look at your experiment, ask questions, criticize it.”Stefan Thomke
Durability: Durable governance pattern for high-throughput test orgs.
Principle
Big data gives correlations; experiments give causality — use both
Big data and experimentation are complementary: big data surfaces testable hypotheses; experimentation delivers causality. Teams that do one without the other ship bad decisions.
Thomke names three failure modes of decision-making under novelty: (1) relying on past experience — fails because novelty breaks priors; (2) relying on correlations in big data — fails because correlation ≠ causality; (3) copying other contexts — fails because context transfers don't (Ron Johnson, JCPenney). The right move is to use big data for hypothesis generation (patterns → testable hypotheses) and experiments for validation (cause → effect).
Use when: Any data team or product team feeding decisions with dashboard output.
Skip when: Research contexts where causal-inference methods (DiD, IV, RCT) are already the default — the gap does not exist.
Use big data to generate hypotheses; use experiments to test them. Do not let a correlation dashboard close a decision loop on its own.
“We have a lot of examples where things correlate very highly and there is no causal relationship... there's a big correlation, for example, between hand size and life expectancy... the underlying causal variable is gender.”Stefan Thomke
“Big data gives us correlations, but it doesn't give us causality.”Stefan Thomke
Durability: Durable; re-derived in every generation of analytics teams.
Principle
Not winning is not losing — failures feed the next hypothesis
When the base rate of success is ~10%, a culture that punishes failure punishes its own primary information source; failures must be explicitly framed as learning artifacts, not losses.
Thomke is explicit: "Failure is the status quo. Not winning is not losing." The signal this sends operationally: share failures across the organization, let the learning from one failed hypothesis seed the next one, and resist the natural manager instinct to treat a 10% hit rate as a performance problem. It's what makes high-velocity incrementalism compound — the feedback loop runs on the 90%, not the 10%.
Use when: Any organization building a high-volume experimentation program.
Skip when: Safety-critical domains where failure means harm, not information.
Engineer explicit rituals for sharing failed experiments; put the learning yield, not the win rate, on the team dashboard.
“Failure is the status quo. Because when nine out of 10 things fail, you're much more likely to run into a failure than not.”Stefan Thomke
“I always say not winning is not losing. It's not the same thing... there's a learning objective here. You can learn from these failures.”Stefan Thomke
Durability: Durable operating principle paired with an enduring cultural failure mode (punishing failure).
Principle
In domains with novelty, intuition is wrong about nine times out of ten
When a decision involves novelty, the base rate of being correct on first intuition is ~10%; acting on gut without testing therefore destroys expected value.
Thomke attributes the 9/10 wrong rate directly to Booking's internal experimental record. The same pattern shows up across Bing, Microsoft, Google, Netflix, Amazon — all of whom run the same playbook because the same law holds. The practical consequence: shift default from "ship the intuition" to "test the intuition, and expect failure".
Use when: Any team making product/pricing/UX decisions where the change involves novelty.
Skip when: Well-solved engineering or operational decisions with abundant historical data.
Recalibrate your decision process around a 10% base rate — if you're not running tests, you're not making decisions, you're making bets at negative expected value.
“Booking has learned over the years is that they're wrong about nine out of 10 times. That is, you have a hypothesis and the hypothesis seems very reasonable and you go out and test it, and then something really surprising happens.”Stefan Thomke
“The default is really that you're much more likely to be wrong than to be right.”Stefan Thomke
Durability: Durable — holds anywhere the decision involves novelty.