ab-test-planner
$
npx mdskill add mohitagw15856/pm-claude-skills/ab-test-plannerDesign experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.
SKILL.md
.github/skills/ab-test-plannerView on GitHub ↗
--- name: ab-test-planner description: "Design statistically rigorous A/B tests for product features, UI changes, onboarding flows, and pricing experiments. Use when asked to set up an experiment, design an A/B test, calculate sample size, or interpret test results. Produces a complete test plan with hypothesis, variant definitions, sample size, duration estimate, guardrail metrics, and a results interpretation guide." --- # A/B Test Planner Skill Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide. ## Required Inputs Ask the user for these if not provided: - **What is being tested** (feature, UI change, copy, pricing, onboarding step) - **Hypothesis** (or ask to help formulate one) - **Primary metric** (conversion rate, click-through, completion rate, etc.) - **Baseline rate** and **minimum detectable effect** (MDE) - **Daily eligible users** (to calculate duration) ## Experiment Design Checklist Before running any test, confirm: - [ ] Clear hypothesis with predicted direction - [ ] Single primary metric (plus up to 2 guardrail metrics) - [ ] Minimum detectable effect (MDE) defined - [ ] Sample size calculated - [ ] Test duration estimated - [ ] Segment isolated (no overlap with other running tests) - [ ] Rollback plan defined ## Hypothesis Template > "We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]." Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis. ## Sample Size Calculator Logic Use this formula (provide the output, not the formula, to the user): - **Baseline conversion rate:** Current rate of primary metric - **MDE:** Smallest change worth detecting (recommend 10–20% relative lift for most features) - **Statistical power:** 80% (standard) - **Significance level:** 95% (p < 0.05) For common scenarios, provide pre-calculated estimates: | Baseline Rate | MDE (Relative) | Required Sample per Variant | |---|---|---| | 5% | 20% | ~19,000 | | 10% | 15% | ~14,000 | | 20% | 10% | ~15,000 | | 40% | 10% | ~9,500 | | 60% | 5% | ~42,000 | Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision." ## Test Duration Guidance Minimum: 2 full weeks (to capture weekly seasonality) Maximum: 4 weeks (novelty effect distorts results beyond this) `Duration = Required sample ÷ (Daily traffic × % exposed)` Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research). ## Output Format ### A/B Test Plan — [Test Name] — [Date] **Hypothesis:** > [Filled hypothesis template] **Variants:** - Control (A): [Current experience] - Treatment (B): [Changed experience — be specific] **Primary Metric:** [Metric name + how measured] **Guardrail Metrics:** [Metrics that must not degrade] **Target Segment:** [Who sees the test — % of traffic, user type] **Traffic Split:** [50/50 recommended unless ramp-up needed] **Sample Size Required:** ~[N] users per variant **Estimated Duration:** [X] weeks (based on [Y] daily eligible users) **Significance Threshold:** 95% confidence, 80% power **Exclusions:** [Any user segments to exclude and why] **Rollback Trigger:** If [guardrail metric] degrades by [X%], stop the test immediately. **Results Interpretation Guide:** - ✅ Ship if: Treatment shows [X%]+ lift on primary metric at 95% confidence AND guardrail metrics are stable - 🔄 Iterate if: Direction is positive but not significant — consider extending or redesigning - ❌ Reject if: No lift or negative direction at significance - ⚠️ Inconclusive: Do not ship. Do not call it a win. --- ## Guidelines - Always recommend against peeking at results before the test reaches planned sample size — explain p-hacking risk - If user wants to test multiple variants, explain the multiple comparisons problem and recommend a Bonferroni correction or a Bayesian approach - If traffic is very low (<1,000 users/day), recommend qualitative alternatives: moderated testing, 5-second tests, or user interviews - Never approve a test with no guardrail metrics — always protect revenue, retention, or core engagement ## Quality Checks - [ ] Hypothesis is directional (predicts a specific direction and magnitude, not "let's see") - [ ] Primary metric is singular (guardrail metrics are secondary) - [ ] Sample size is calculated from actual MDE and baseline (not guessed) - [ ] Test duration accounts for weekly seasonality (minimum 2 weeks) - [ ] Guardrail metrics are defined (at least one to protect revenue or core engagement) - [ ] Rollback trigger is specified with a concrete threshold
More from mohitagw15856/pm-claude-skills
- 360-feedback-templateDesign a 360-degree feedback survey or write a structured 360 feedback report. Use when asked to build a 360 feedback process, write 360 feedback for a colleague, design a feedback survey, or produce a feedback report. Produces either a complete survey instrument with rating scales and open-ended questions, or a structured narrative feedback report with themes, strengths, and development areas.
- accessibility-auditGenerate a WCAG 2.2 accessibility audit checklist and remediation suggestions for any UI or design. Use when asked to audit for accessibility, check WCAG compliance, review a design for a11y issues, or create an accessibility remediation plan. Produces a prioritised checklist with pass/fail assessments and specific fixes.
- account-planBuild a structured account plan for any key customer or target account. Use when asked to create an account plan, key account strategy, strategic account review, or territory plan. Produces a complete account plan with relationship map, growth opportunities, risks, and 90-day action plan.
- aeo-optimizerOptimize an article for Answer Engine Optimization (AEO) — restructuring content so AI engines like ChatGPT, Perplexity, and Claude can extract, quote, and cite it. Rewrites headings as questions, drops 50-80 word answer capsules, audits paragraph length, and flags trust signals. Use when asked to AEO-optimize, make content AI-readable, improve AI citation chances, or adapt an article for answer engines.
- ai-ethics-reviewConduct an ethical review of an AI or ML feature, model, or product. Use when asked to run an AI ethics review, assess AI risks, audit a model for bias, or produce an AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with prioritised mitigations.
- ai-product-canvasStructure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.
- ambiguity-resolverStructure vague opportunities and unclear briefs into actionable one-page problem statements. Use when asked to clarify a vague brief, frame an undefined problem, make sense of an unclear opportunity, or when the user says 'we need to figure out what to do about X' or 'I've been asked to look into Y'. Produces a structured problem brief with reframed questions, scoped boundaries, and a minimum viable research plan.
- api-docs-writerWrite clear, developer-facing API documentation. Use when asked to document an API endpoint, write API reference docs, create a developer guide, or turn a raw spec/Postman collection into documentation. Produces endpoint documentation with descriptions, parameters, request/response examples, and error codes.
- api-versioning-strategyWrite an API versioning strategy document for a service or API platform. Use when asked to define versioning policy, plan API deprecation, classify breaking changes, or document version lifecycle. Produces a complete versioning strategy with breaking-change classification table, deprecation timeline, migration guide template, and client communication template.
- architecture-decision-recordCreate an Architecture Decision Record (ADR) for any technical decision. Use when asked to document a technical decision, write an ADR, record an architecture choice, or capture why a technology or approach was selected. Produces a structured ADR with context, decision, consequences, and tradeoffs.