PM Experiment Design Guide
(2026 Edition)
6 steps to design A/B tests that produce real signal, when NOT to A/B test at all, and the 6 mistakes that make most PM experiments worthless.
Practice Experiment Scenarios — Free →The 6-Step Experiment Design Process
Write a Falsifiable Hypothesis
Not 'we think new onboarding will be better.' Try: 'Changing onboarding from 5 steps to 3 steps will increase Day-7 retention from 22% to ≥26%, because we believe friction is the primary driver of drop-off.'
❌ Anti-pattern: Vague hypothesis ('users will like it more') that can never be falsified.
Define Primary and Guardrail Metrics
Primary: one metric that defines success. Guardrails: 2–3 metrics that must not degrade. Example: primary = activation rate; guardrails = D1 uninstall rate, support ticket volume, funnel step-2 completion.
❌ Anti-pattern: Picking only the primary metric. A feature that wins on activation but increases uninstalls by 10% is a net loss.
Calculate Required Sample Size
Use a sample size calculator. Inputs: baseline metric, minimum detectable effect (MDE), significance (0.05), power (0.8). Example: for a 20% → 25% conversion lift on a 5,000/day funnel, you need ~14 days.
❌ Anti-pattern: Running the test for 'however long feels right.' Small samples produce false positives; stopping too early inflates winners.
Randomise Correctly
Users — not sessions — are usually the right unit of randomisation. For marketplaces, geographies or dark stores may be better units to avoid spillover. Test your randomisation with an A/A test before high-stakes launches.
❌ Anti-pattern: Randomising sessions when users visit multiple times, creating contamination across variants.
Run for the Pre-Determined Window
Don't peek. Don't stop early because the number looks good on Day 3. Pre-commit to a sample size or duration and don't deviate without a strong pre-stated reason.
❌ Anti-pattern: Stopping when significance is first hit ('peeking'). This is the #1 cause of false positives in product A/B tests.
Analyse and Decide
Check: did primary metric move significantly? Did guardrails stay healthy? Are there user segments with opposite results? Decide: ship, kill, or iterate. Document learnings either way.
❌ Anti-pattern: Shipping a flat test because 'the feature feels right.' A/B tests exist to make the decision — respect the result.
When NOT to Run an A/B Test
Obvious bug fixes
If the old behaviour was broken, you don't need to 'test' fixing it.
Irreversible changes (infrastructure migrations, compliance)
Testing has no meaning when you can't choose to revert.
Strong prior + small traffic
With 100 users/day on a funnel, you physically can't collect enough data. Ship on judgment.
Brand/strategy decisions
A/B tests optimise locally. Brand changes are strategic bets that often show no short-term metric lift.
Cosmetic/copy changes with near-zero risk
Ship and monitor — the cost of running a full experiment exceeds the value.
Features with delayed metrics (yearly retention)
You can't wait 12 months to test. Use proxy metrics or judgment + monitoring.
6 Common A/B Testing Mistakes
Testing too many things at once
→ Isolate one change per test. If you bundle 3 changes and the test wins, you don't know which change drove it.
Measuring the wrong metric
→ The metric should match the hypothesis. If the hypothesis is about activation, don't judge on revenue.
Ignoring novelty effects
→ New features often get a Week 1 bump that fades. Run tests for at least 2 full weekly cycles.
Not segmenting results
→ A flat aggregate test can hide huge wins in one segment and losses in another. Always look at segments.
Declaring winners based on p-value alone
→ A statistically significant 0.2% lift may not be worth shipping. Check effect size AND significance.
Running too many parallel tests
→ Multiple tests in the same funnel interfere. Coordinate with other PMs to sequence or segment traffic.
FAQ
How long should a typical A/B test run?
Minimum 7 days to cover one weekly cycle. Most tests run 14 days to capture two cycles and mitigate day-of-week effects. For high-traffic consumer products, 7–14 days is typical. For B2B or low-traffic products, tests may need to run 30–60 days — or you may need to rely on judgment instead.
What's the minimum traffic needed to run an A/B test?
Depends on baseline metric and desired minimum detectable effect (MDE). Rule of thumb: for a 10% relative lift on a metric with 20% baseline, you need roughly 5,000–10,000 users per variant. Below 1,000 daily active users on the tested surface, most tests won't produce significant results in reasonable time — rely on judgment and qualitative signal.
What's a 'guardrail metric' and why does it matter?
A guardrail metric is something your experiment must NOT break, even if your primary metric wins. Example: your hypothesis is 'bigger CTA buttons will increase click-through.' Primary: CTR. Guardrail: time-on-page (the page shouldn't become less useful). A feature that wins on CTR but destroys time-on-page is not a win — it's a regression. PMs who don't define guardrails regularly ship false positives that look good in isolation but hurt user experience.
Build Experimentation Intuition Daily
Real A/B test scenarios with AI feedback on hypothesis quality and metric selection.
Start Free Trial →