A/B Testing (Online Experimentation)

Overview

A/B Testing (Split Testing) is the gold standard for causal inference in product development. Unlike offline evaluation (Accuracy/F1), which proxies performance, A/B testing measures the actual impact on business metrics (Revenue, Retention, CTR) in a live environment.

Key Ideas / Intuition

  • Randomization: Randomly assigning users to groups (Control vs. Treatment) eliminates confounding variables (e.g., time of day, user demographics), ensuring that any difference in outcome is caused by the model.
  • Hypothesis Testing: We are not just checking if ; we are checking if the difference is statistically significant or just noise.

Core Concepts

1. Statistical Foundations

We test a Null Hypothesis () against an Alternative Hypothesis ().

Intuition

  • (Null Hypothesis): The new model has no effect.
  • Goal: We assume is true until evidence (data) proves otherwise.
  • P-Value: The probability that the evidence appeared purely by coincidence.
    • Low P-Value (< 0.05) → “It’s extremely unlikely this happened by chance.” → is true.
    • High P-Value → “Evidence is weak.” → Stick with .

Errors

DecisionReality: is True (No Diff)Reality: is False (Diff exists)
Accept CorrectType II Error ()
(Letting a criminal go free)
Reject Type I Error ()
(Convicting an innocent person)
Correct (Power)
  • Significance Level (): Usually 0.05. We accept a 5% risk of a False Positive.
  • Power (): Usually 0.80. We want an 80% chance of catching a real improvement.

2. Metric Hierarchy

  • North Star Metric: The ultimate long-term goal (e.g., Customer Lifetime Value). Hard to move in short tests.
  • Driver Metrics (OEC - Overall Evaluation Criteria): Short-term proxies that correlate with North Star.
    • Example: CTR, Conversion Rate, Session Length.
  • Guardrail Metrics: Constraints that must not be violated.
    • Example: Latency, Error Rate, Unsubscribe Rate.

3. Sample Size Calculation

The number of users required depends on the Minimum Detectable Effect (MDE) () and Variance ().

  • Takeaway: To detect a smaller improvement (), you need quadratically more users ().
  • Takeaway: Reducing metric variance () (e.g., using CUPED) allows for faster experiments.

Common Pitfalls

1. Sample Ratio Mismatch (SRM)

If you target a 50/50 split but get 49/51, STOP.

  • Cause: The treatment model might be crashing/slower for some users, causing them to drop out before the event is logged.
  • Result: The groups are no longer comparable. Invalidates the test.

2. Peeking (P-hacking)

Checking the p-value every day and stopping as soon as .

  • Problem: Increases False Positive rate to >30%.
  • Solution: Fix sample size in advance or use Sequential Testing (SPRT).

3. Novelty Effect

Users click the new feature just because it looks different, not because it’s better.

  • Solution: Run the test longer (weeks) to let the novelty wear off.

4. Network Effects (Interference)

In two-sided marketplaces (Uber/Airbnb), treating one user affects others (e.g., Driver A gets a ride, Driver B doesn’t).

  • Solution: Cluster Randomization (split by city) or Switchback Testing (split by time windows).

Comparison: A/B vs Bandits

FeatureA/B TestingMulti-Armed Bandits (MAB)
GoalStatistical Significance (Knowledge)Regret Minimization (Reward)
Traffic SplitFixed (50/50)Dynamic (shifts to winner)
Best forMajor UX/Model changesContent optimization, Headlines, Ads
SpeedSlow (Safety first)Fast (Optimization first)

Resources

Back to: ML Deployment Patterns