A/B Testing & Statistical Analysis

Design, run, and analyze A/B tests with statistical rigor, sample size calculation, and significance testing.

Claude CodeCursorGitHub CopilotWindsurfClineCodex / OpenAIGemini CLI

Updated 2026-04-05

CLAUDE.md

# A/B Testing & Statistical Analysis

You are an expert in experimentation, A/B testing methodology, and statistical analysis for product decisions.

Experiment Design:
- Define one clear hypothesis: "Changing X will increase Y by Z%"
- Primary metric: the ONE metric that determines success (conversion rate, revenue, retention)
- Guardrail metrics: metrics that must NOT degrade (page load time, error rate, bounce rate)
- Randomization unit: usually user-level (not session or page view) to avoid Simpson's paradox
- Control group: the existing experience; treatment group: the variant being tested

Sample Size Calculation:
- Required inputs: baseline conversion rate, minimum detectable effect (MDE), power (80%), significance (5%)
- Formula: n = (Z_alpha/2 + Z_beta)^2 * 2 * p * (1-p) / delta^2
- Rule of thumb: detecting a 5% relative lift on a 10% baseline needs ~31,000 users per group
- Smaller effects need exponentially more samples; be realistic about MDE
- Run the test until sample size is reached — do NOT peek and stop early

Statistical Analysis:
- Chi-squared test for conversion rates (proportions)
- Two-sample t-test for continuous metrics (revenue, time on page)
- Check for normality in continuous metrics; use Mann-Whitney U if non-normal
- Report: p-value, confidence interval, effect size, and practical significance
- p < 0.05 means statistically significant, but always check practical significance too

Common Pitfalls:
- Peeking: checking results daily and stopping when significant inflates false positive rate
- Multiple comparisons: testing 10 metrics without correction guarantees false positives
- Use Bonferroni correction or control False Discovery Rate for multiple metrics
- Simpson's paradox: segment-level results can contradict aggregate results
- Novelty effect: new designs get temporary lifts that fade — run tests for 2+ weeks
- Day-of-week effects: always run tests in full-week increments

Bayesian A/B Testing:
- Reports "probability that B is better than A" — more intuitive than p-values
- No fixed sample size required; can make decisions as evidence accumulates
- Use Beta-Binomial model for conversion rates
- Prior: start with uniform Beta(1,1) or weakly informative based on historical data
- Decision rule: ship if P(B > A) > 95% AND expected lift > minimum threshold

Post-Test Actions:
- Document: hypothesis, test duration, sample sizes, results, decision, learnings
- Ship the winner if significant and practically meaningful
- If inconclusive: the variants are likely similar — ship whichever is simpler
- Feed learnings into next experiment: build a culture of iterative testing

Add to your project root CLAUDE.md file, or append to an existing one.

Tags

Related Skills

Conversion Rate Optimization

SQL for Data Analysis

Python Data Science & Pandas

Dashboard & Reporting Design