A/B Testing Best Practices for Mobile Apps: Statistical Rigor, Sample Sizes, and Test Design

Best practices for A/B testing on a mobile app require pre-calculating sample size before the test starts, running for a minimum of 14–21 days to account for day-of-week variance and the novelty effect, segmenting results by platform and user cohort, and never ending a test early based on peeking at p-values — because mobile behavior is session-based, retention-driven, and disproportionately shaped by first-week patterns that only stabilize after two to three weeks.

Mobile A/B testing produces unreliable results far more often than web testing. The failure modes are predictable: tests ended too early, metrics measured at the wrong time, samples not segmented by platform, and teams pulling results as soon as they look positive. This guide covers the practices that produce mobile test results you can ship with confidence.

Why Mobile A/B Testing Fails More Often Than Web

Mobile user behavior has three characteristics that make A/B testing more difficult than web:

Session fragmentation: Mobile users interact in short, frequent sessions. A test result that looks significant after 3 days often reverses by day 14 because early session behavior doesn't represent steady-state usage.

Retention is the real metric: Mobile product success is measured by D7 and D30 retention, not click-through rates. A UI test that improves next-step tap rate but damages retention is a harmful change — and you won't see the damage in a 5-day test.

Platform heterogeneity: An interaction pattern that works on iOS may produce opposite results on Android due to system UI conventions, back button behavior, notification patterns, and screen density differences. Aggregating iOS and Android without segmenting treats two different products as one.

Best Practice 1: Define the Primary Metric Before Writing the Test

Every mobile A/B test needs exactly one primary metric — the metric the test is designed to move. Secondary metrics inform learning, not decisions.

Matching primary metrics to test types:

| Test type | Primary metric | Why | |-----------|---------------|-----| | Onboarding flow | D7 retention | Onboarding quality predicts long-term retention | | Feature discovery | Feature adoption rate at D14 | Feature awareness ≠ adoption without time | | Push notification copy | 7-day opt-in rate | Single session metric is reliable for notifications | | Paywall design | Revenue per user at D30 | Purchase decisions are influenced by monthly cycles | | Core navigation | Session depth at D7 | Navigation changes affect long-term engagement |

What not to use as a primary metric:

"Engagement" (not a specific action)
CTR without retention context (can improve CTR and harm retention simultaneously)
D1 retention alone for onboarding tests (D1 is too noisy, D7 is the reliable signal)

Best Practice 2: Calculate Sample Size Before the Test Starts

The most expensive mobile testing mistake is ending a test when the result looks significant before reaching the pre-calculated sample size. Stopping early on a positive result is p-hacking — it systematically produces false positives.

Minimum sample size formula for binary metrics:

n = 16 × p(1-p) / δ²

Where:
  p = baseline conversion rate
  δ = minimum detectable effect (absolute)
  16 = factor for 80% power at 95% confidence
  n = required sample per variant

Practical example:

Baseline D7 retention: 35%
Minimum detectable effect: 3 percentage points (you won't ship for less)
n = 16 × (0.35 × 0.65) / (0.03²) = 16 × 0.2275 / 0.0009 = 4,044 per variant

At 500 new users per day, this test needs approximately 8 days of traffic to reach minimum sample — but you should still run for 14 days minimum to cover two full business cycles.

Rule of thumb for mobile retention tests: 10,000 users per variant with a 14-day minimum. Under-powered tests at 1,000–2,000 per variant are the most common source of false positives in mobile testing programs.

Best Practice 3: Run Tests for the Full Minimum Duration

Pre-calculate and commit to a minimum duration. Common minimum durations:

| Test type | Minimum duration | Reason | |-----------|-----------------|--------| | Onboarding tests | 21 days | Novelty effect takes 2–3 weeks to stabilize | | Core feature tests | 14 days | Two full business cycles required | | Notification tests | 7 days | Session-based, stabilizes quickly | | Monetization tests | 30 days | Purchase cycles vary weekly | | Navigation tests | 14 days | Habituation to new patterns takes 1–2 weeks |

The Novelty Effect on Mobile

The novelty effect is the tendency for users to engage more with a new experience in the first week simply because it's different. On mobile, where onboarding tests are the most common test type, the novelty effect is particularly pronounced — treatment group users often show better D3 retention than control simply because the new experience prompted more exploration.

By week 3, novelty wears off and true treatment effect emerges. Tests ended at day 7 for onboarding changes frequently show false positives that reverse at day 21.

How to control for novelty: Run onboarding tests for 21 days minimum. Check if the treatment effect at day 21 is smaller than the effect at day 7 — a large difference signals novelty inflation.

Best Practice 4: Segment Results by Platform, Cohort, and Channel

A mobile test that shows no aggregate effect is likely hiding heterogeneous results across segments. Always segment:

Platform segmentation (always required):

iOS vs. Android results often diverge significantly for UI and interaction tests
File separate results for each platform and make separate ship decisions if needed

User cohort segmentation:

New users vs. returning users (tests on new users are subject to novelty effect; returning user tests are not)
Activated vs. not activated (an unactivated user will respond to a feature test differently than an activated one)

Acquisition channel segmentation:

Paid users (price-sensitive, high acquisition cost, behavior differs from organic)
Organic users (higher intent, better baseline retention)
Specific campaign cohorts (users from a specific ad have correlated attributes)

According to Shreyas Doshi on Lenny's Podcast, the most common cause of mobile tests being called inconclusive when they should have produced a result is the failure to segment — a flat aggregate result almost always hides a strong positive in one cohort and a strong negative in another.

Best Practice 5: Never Peek Before Reaching Sample Size

Checking results before the test reaches its pre-calculated sample size inflates your false positive rate. At 95% confidence, if you peek at results 10 times during a test, your effective false positive rate approaches 40% — not 5%.

Solutions for teams that need faster answers:

Sequential testing (SPRT): Allows valid early stopping when results are conclusive without inflating false positives. Requires analytics platform support (Statsig, Optimizely, Amplitude Experiment all support this).

Bayesian A/B testing: Provides a probability estimate that the treatment is better, which can be updated continuously. More interpretable for product teams; does not require pre-specified sample size.

Use one of these frameworks rather than peeking at frequentist p-values if your team needs to stop tests early.

Best Practice 6: Build a Test Log and Review Cadence

According to Lenny Rachitsky on his podcast discussing experimentation culture, the product teams that ship the best mobile apps run 15–20 concurrent experiments — not because they have more ideas, but because they have built the infrastructure to run tests at low marginal cost and the discipline to review them consistently.

Minimum mobile testing infrastructure:

Test log: Every test documented with hypothesis, primary metric, sample size target, duration, result, and decision
Experimentation platform: Statsig, Firebase A/B Testing, or Amplitude Experiment integrated with your analytics
Bi-weekly results review: 30-minute meeting to review concluded tests and make ship/kill decisions
Pre-test review: PM and data analyst sign off on hypothesis and metric definition before any test starts

Test log template (one row per test):

| Test ID | Hypothesis | Primary Metric | Sample Target | Duration | Result | Decision | Date | |---------|-----------|---------------|---------------|----------|--------|----------|------| | MOB-001 | New onboarding reduces friction | D7 retention | 10K/variant | 21 days | +2.3pp | Ship | [Date] |

Mobile A/B Testing Checklist

Before starting any mobile A/B test, verify:

[ ] Primary metric defined and measurable in your analytics platform
[ ] Sample size calculated with baseline, MDE, and confidence level documented
[ ] Minimum duration set and committed to in writing
[ ] Segmentation plan defined (iOS/Android, new/returning, activated/not)
[ ] Rollback plan if treatment performs worse than expected
[ ] Sequential testing configured if early stopping is required

FAQ

Q: What are best practices for A/B testing on a mobile app? A: Define a single primary metric before starting, calculate minimum sample size, commit to a 14–21 day minimum duration, segment results by platform and user cohort, and never end a test based on peeking at results before reaching your pre-calculated sample size.

Q: How long should you run an A/B test on a mobile app? A: Minimum 14 days for feature tests and 21 days for onboarding tests. Mobile tests need to cover at least two business cycles for day-of-week variance and 3 weeks for onboarding tests to account for the novelty effect.

Q: How many users do you need for a valid mobile A/B test? A: At minimum 10,000 users per variant for retention-focused tests using 80% power at 95% confidence. Calculate exact sample size from your baseline metric, minimum detectable effect, and confidence requirements before starting.

Q: What is the novelty effect and how does it affect mobile A/B tests? A: The novelty effect causes users to engage more with a new experience in the first week simply because it's different. For onboarding tests on mobile, this can inflate treatment results by day 7 by 30–50% compared to the true effect that stabilizes by day 21.

Q: What primary metric should you use for a mobile onboarding A/B test? A: D7 retention, not D1 or CTR. D7 retention is the most predictive early signal of long-term retention — D1 is too noisy and CTR improvements do not guarantee retention improvements.

HowTo: Run an A/B Test for a Mobile App

Define a single primary metric the test is designed to move — D7 retention for onboarding tests, feature adoption rate at D14 for feature tests, or revenue per user at D30 for monetization tests
Calculate the required sample size before starting using 80 percent power at 95 percent confidence based on your baseline metric and the minimum detectable effect worth shipping
Set a minimum test duration of 14 days for feature tests and 21 days for onboarding tests to control for day-of-week variance and the novelty effect that inflates early results
Segment results separately for iOS and Android users and for new versus returning users and activated versus unactivated cohorts to detect heterogeneous effects hidden by aggregates
Never end a test early based on peeking at p-values — use a sequential testing framework such as Statsig or Amplitude Experiment if you need valid early stopping capability
Log every test with its hypothesis, primary metric, sample size target, duration, result, and ship or kill decision to build an institutional knowledge base of what works on your specific app