PM A/B Test Analysis Guide
(2026 Edition)
7-point checklist for reading results, 5 segmentation lenses, 6 common biases, and 5 decision rules for shipping or killing.
Build Experimentation Skills Daily — Free →7-Point Reading Checklist
Did we reach pre-committed sample size? If not, it's not done yet.
Is the effect statistically significant? (p-value < 0.05)
Is the effect size meaningful? (Practical significance, not just statistical)
Did guardrail metrics stay healthy? Winning primary + broken guardrail = net loss.
Does the effect hold across segments? If only 1 segment drives it, that's important context.
Are there novelty effects that might fade? (Run 2 weekly cycles to confirm)
Is the AA check clean? (A/A test during the run should show no difference)
5 Segmentation Lenses
New vs existing users — often move opposite directions
Mobile vs web — mobile-first products ship differently to each
Geographic — Tier-1 vs Tier-2/3 may behave differently
Acquisition channel — organic vs paid users have different baselines
Cohort (signup date) — recent cohorts can differ from old ones
6 Common Biases to Avoid
Peeking early and stopping when you see significance — p-hacking
Running multiple tests, picking the one that 'won' — multiple-comparison problem
Attributing lift to the feature when seasonality explains it — correlation vs causation
Ignoring guardrails that moved — primary won, but at what cost?
Reading a flat test as 'no effect' vs 'effect too small to detect' — different conclusions
Using the test as confirmation of your hypothesis rather than a test of it
5 Decision Rules
Primary wins significantly + guardrails healthy → ship
Primary flat + guardrails healthy → don't ship, but learnings are valuable
Primary wins but a guardrail breaks → don't ship, investigate trade-off
Primary wins in 1 segment only → ship to that segment if big enough; don't generalise
Result is inconclusive (underpowered) → decide: extend the test, run at higher N, or call based on judgment
FAQ
What p-value should PMs use for A/B tests?
0.05 is the industry default. For high-stakes tests (major redesigns, monetisation changes), use 0.01 — you want higher certainty before shipping. For quick iteration on low-risk features, 0.1 is sometimes acceptable. The trade-off: tighter p-value = more certainty but longer runs.
How long should A/B tests run?
Pre-determined by your sample size calculation. For most consumer products, 7–14 days is typical (covers weekday/weekend patterns). Shorter tests miss cyclical effects; longer tests delay decisions unnecessarily. The cardinal sin: extending tests mid-run to 'wait for significance' — that's p-hacking, not patience.
Build Experimentation Intuition Daily
Daily scenarios on reading experiment results and making correct ship/kill calls.
Start Free Trial →