📊 Reading A/B tests well is 50% discipline, 50% not fooling yourself

PM A/B Test Analysis Guide
(2026 Edition)

7-point checklist for reading results, 5 segmentation lenses, 6 common biases, and 5 decision rules for shipping or killing.

Build Experimentation Skills Daily — Free →

7-Point Reading Checklist

Did we reach pre-committed sample size? If not, it's not done yet.

Is the effect statistically significant? (p-value < 0.05)

Is the effect size meaningful? (Practical significance, not just statistical)

Did guardrail metrics stay healthy? Winning primary + broken guardrail = net loss.

Does the effect hold across segments? If only 1 segment drives it, that's important context.

Are there novelty effects that might fade? (Run 2 weekly cycles to confirm)

Is the AA check clean? (A/A test during the run should show no difference)

5 Segmentation Lenses

1.

New vs existing users — often move opposite directions

2.

Mobile vs web — mobile-first products ship differently to each

3.

Geographic — Tier-1 vs Tier-2/3 may behave differently

4.

Acquisition channel — organic vs paid users have different baselines

5.

Cohort (signup date) — recent cohorts can differ from old ones

6 Common Biases to Avoid

⚠️

Peeking early and stopping when you see significance — p-hacking

⚠️

Running multiple tests, picking the one that 'won' — multiple-comparison problem

⚠️

Attributing lift to the feature when seasonality explains it — correlation vs causation

⚠️

Ignoring guardrails that moved — primary won, but at what cost?

⚠️

Reading a flat test as 'no effect' vs 'effect too small to detect' — different conclusions

⚠️

Using the test as confirmation of your hypothesis rather than a test of it

5 Decision Rules

1.

Primary wins significantly + guardrails healthy → ship

2.

Primary flat + guardrails healthy → don't ship, but learnings are valuable

3.

Primary wins but a guardrail breaks → don't ship, investigate trade-off

4.

Primary wins in 1 segment only → ship to that segment if big enough; don't generalise

5.

Result is inconclusive (underpowered) → decide: extend the test, run at higher N, or call based on judgment

FAQ

What p-value should PMs use for A/B tests?

0.05 is the industry default. For high-stakes tests (major redesigns, monetisation changes), use 0.01 — you want higher certainty before shipping. For quick iteration on low-risk features, 0.1 is sometimes acceptable. The trade-off: tighter p-value = more certainty but longer runs.

How long should A/B tests run?

Pre-determined by your sample size calculation. For most consumer products, 7–14 days is typical (covers weekday/weekend patterns). Shorter tests miss cyclical effects; longer tests delay decisions unnecessarily. The cardinal sin: extending tests mid-run to 'wait for significance' — that's p-hacking, not patience.

Build Experimentation Intuition Daily

Daily scenarios on reading experiment results and making correct ship/kill calls.

Start Free Trial →