A/B Testing Best Practices for a Mobile App Feature: Test Design and Statistical Rigor

Best practices for conducting A/B testing for a mobile app feature require defining a single primary metric before the test starts, calculating the minimum sample size using 80% power at 95% confidence, running for at least 14 days to cover two business cycles, segmenting results by iOS vs. Android and new vs. returning users, and never stopping the test early based on peeking at p-values — because mobile user behavior is session-based and early results systematically mislead.

Mobile feature A/B tests fail in three predictable ways: ended too early when results look positive, measured on the wrong metric (tap rate instead of retention), or not segmented by platform — where iOS and Android users often respond differently to the same change.

Step 1: Write a Falsifiable Hypothesis

Every mobile feature test should start with a hypothesis that can be proven wrong:

Template: "We believe that [change] will cause [metric] to [increase/decrease] by [minimum amount] for [user segment] because [reasoning]."

Example (good): "We believe that replacing the text-only CTA on the onboarding completion screen with a visual progress indicator will increase D7 retention by at least 2 percentage points for new users because it makes progress tangible and reduces abandonment anxiety."

Example (bad): "We believe that improving the onboarding will increase engagement." (Not measurable, no segment, no minimum effect, no reasoning)

The hypothesis forces you to specify the primary metric and minimum detectable effect before collecting data — preventing post-hoc metric selection.

Step 2: Define the Primary Metric

Rule: One test, one primary metric. Secondary metrics are for learning, not decision-making.

Matching primary metric to feature type:

| Feature type | Primary metric | Wrong metric | |-------------|---------------|-------------| | Onboarding change | D7 retention | D1 retention (too noisy), tap rate | | Core navigation change | Session depth at D7 | Page views (session-based, not retention) | | Push notification copy | 7-day opt-in rate | Open rate on day 1 | | Paywall design | Revenue per user at D30 | Conversion rate alone (ignores refund rate) | | Feature discovery | Feature adoption rate at D14 | Feature view count |

Why D7 retention over D1 for onboarding tests: D1 retention is influenced by novelty and app store review prompts that fire on day 1. D7 retention reflects genuine habit formation.

Step 3: Calculate Sample Size

Pre-calculate required sample size before the test starts. Running until results look significant is p-hacking.

Formula for binary metrics:

n = 16 × p(1-p) / δ²
Where: p = baseline rate, δ = minimum detectable effect, n = per variant

Example: D7 retention baseline 32%, minimum detectable effect 2.5pp

n = 16 × (0.32 × 0.68) / (0.025²)
n = 16 × 0.2176 / 0.000625
n = 5,570 per variant (11,140 total)

At 800 new users per day, this test needs 14 days minimum to reach sample size.

Rule of thumb: For D7 retention tests on mobile, require 10,000 users per variant as a minimum regardless of calculated sample size to account for mobile behavior variance.

Step 4: Set Minimum Test Duration

The minimum duration is the longer of: (calculated sample size / daily volume) OR 14 days.

Why 14 days minimum:

Covers 2 full business cycles (Monday–Sunday behavior differs significantly)
Allows novelty effect to partially stabilize (new UX drives higher engagement in week 1)

Duration by feature type:

| Feature | Minimum duration | Why | |---------|-----------------|-----| | Onboarding flow | 21 days | Novelty effect is strongest; week 3 is the stable signal | | Core feature | 14 days | Two business cycles required | | Push notifications | 7 days | Session-based; stabilizes quickly | | Monetization / paywall | 30 days | Monthly purchase cycle influences behavior |

Step 5: Configure Segmented Analysis

Required segmentation for every mobile feature test:

Platform: iOS vs. Android — file separate results and make separate ship decisions if needed
New vs. returning users — new users are subject to novelty effect; returning users are not
Activation state — activated users respond differently to feature changes than non-activated users
Acquisition channel — paid users behave differently from organic users

A flat aggregate result almost always hides a positive effect in one segment and a negative in another. Segment before reporting.

According to Lenny Rachitsky on his podcast discussing mobile experimentation culture, teams that run 15+ concurrent mobile experiments ship better apps not because they have more ideas but because they built the discipline to segment results — most mobile insights come from segment-level analysis, not aggregate p-values.

Step 6: Analyze and Decide

Decision framework:

| Scenario | Decision | |----------|---------| | Primary metric improvement ≥ MDE, p <0.05, reached sample size | Ship | | Primary metric improvement ≥ MDE for some segments, negative for others | Ship to winning segments only (feature flag) | | Primary metric no change, p >0.05 | Do not ship; revisit hypothesis | | Primary metric improvement but secondary metric regression | Investigate before shipping | | Primary metric negative (treatment worse than control) | Do not ship; investigate root cause |

Never ship on: Test ended early, sample below threshold, multiple testing without correction.

Common Mobile Feature A/B Testing Mistakes

| Mistake | Why it's wrong | Correct approach | |---------|---------------|-----------------| | Peeking at results daily and stopping when positive | Inflates false positive rate to 40%+ | Pre-commit to sample size and duration | | Testing too many variants simultaneously | Requires much larger sample; confounds results | Test one variant at a time unless using factorial design | | Ignoring iOS/Android difference | iOS and Android users respond differently | Segment and analyze separately | | Using D1 retention for onboarding tests | Too noisy, influenced by novelty | Use D7 retention |

FAQ

Q: What are best practices for A/B testing a mobile app feature? A: Write a falsifiable hypothesis, define one primary metric, calculate sample size before starting, run for at least 14 days, segment by iOS/Android and new/returning users, and never stop early based on peeking at results.

Q: How long should you run a mobile feature A/B test? A: At minimum 14 days for core features and 21 days for onboarding changes. Mobile tests need two business cycles for day-of-week variance and 3 weeks for onboarding to control for the novelty effect.

Q: What is the minimum sample size for a mobile app A/B test? A: 10,000 users per variant for D7 retention tests as a minimum floor, regardless of the calculated sample size — mobile behavior variance is high enough that smaller samples produce unreliable results.

Q: Why should you segment mobile A/B test results by iOS and Android? A: The same UI or feature change often produces different results on iOS and Android due to system interaction patterns, notification behavior, and screen conventions. Aggregate results hide these differences and can lead to shipping a change that helps one platform and hurts the other.

Q: What is the novelty effect in mobile A/B testing? A: The tendency for users to engage more with any new experience in the first week simply because it is different. For onboarding tests, this inflates week-1 treatment group metrics — running for 21 days lets the effect stabilize before you read results.

HowTo: Conduct an A/B Test for a Mobile App Feature

Write a falsifiable hypothesis specifying the change, primary metric, minimum detectable effect, target user segment, and reasoning before any test configuration
Define one primary metric matched to the feature type — D7 retention for onboarding, feature adoption at D14 for feature tests, revenue per user at D30 for monetization
Calculate required sample size using the binary metric formula and set a minimum of 10,000 users per variant for D7 retention tests
Set minimum test duration at 14 days for feature tests and 21 days for onboarding tests regardless of sample size timing
Pre-configure segmented analysis by iOS vs. Android and new vs. returning users and activation state before the test starts
Make ship or no-ship decisions only after reaching both the pre-calculated sample size and minimum duration — never based on peeking at results before these thresholds are met