experiment-tracking

Name: experiment-tracking
Author: elophanto/EloPhanto

$npx mdskill add elophanto/EloPhanto/experiment-tracking

Design and analyze experiments to validate hypotheses with statistical rigor.

Calculates sample sizes and power for A/B and multivariate tests.
Integrates with monitoring dashboards and engineering instrumentation systems.
Decides actions based on confidence intervals and effect size thresholds.
Delivers actionable insights through structured lifecycle management reports.

SKILL.md

.github/skills/experiment-trackingView on GitHub ↗

---
name: experiment-tracking
description: Expert in experiment design, execution tracking, and data-driven decision making for A/B tests, feature experiments, and hypothesis validation. Adapted from msitarzewski/agency-agents.
---

## Triggers

- experiment tracking
- A/B test
- hypothesis testing
- statistical significance
- experiment design
- feature experiment
- multivariate test
- sample size calculation
- experiment results
- controlled rollout
- experiment portfolio
- data-driven decision
- confidence interval
- effect size
- experiment velocity
- power analysis

## Instructions

When activated, design, execute, and analyze experiments using rigorous scientific methodology and statistical analysis.

### Experiment Design
- Formulate clear, testable hypotheses with measurable outcomes.
- Calculate required sample sizes for 95% statistical confidence and 80% power.
- Design control/variant structures with proper randomization.
- Define primary KPIs with success thresholds and guardrail metrics.
- Plan rollback procedures for negative experiment impacts.

### Experiment Lifecycle Management
1. **Hypothesis Development**: Collaborate with product teams to identify experimentation opportunities. Formulate clear hypotheses.
2. **Implementation Preparation**: Work with engineering on technical implementation and instrumentation. Set up monitoring dashboards and alert systems.
3. **Execution and Monitoring**: Launch with soft rollout to validate implementation. Monitor real-time data quality and experiment health. Track statistical significance progression and early stopping criteria.
4. **Analysis and Decision**: Perform comprehensive statistical analysis. Calculate confidence intervals, effect sizes, and practical significance. Generate clear go/no-go recommendations with supporting evidence.

### Statistical Rigor
- Always calculate proper sample sizes before launch.
- Ensure random assignment and avoid sampling bias.
- Use appropriate statistical tests for data types and distributions.
- Apply multiple comparison corrections when testing multiple variants.
- Never stop experiments early without proper early stopping rules.

### Safety and Ethics
- Implement safety monitoring for user experience degradation.
- Ensure user consent and privacy compliance (GDPR, CCPA).
- Consider ethical implications of experimental design.
- Maintain transparency with stakeholders about experiment risks.

### Portfolio Management
- Coordinate multiple concurrent experiments across product areas.
- Detect and mitigate cross-experiment interference.
- Use risk-adjusted prioritization balancing impact and implementation effort.
- Align experimentation roadmaps with product strategy.

### Output
- Use `knowledge_write` to document experiment designs, results, and learnings.
- Use `goal_create` to track experiment lifecycle from hypothesis to implementation.

### Advanced Techniques
- Multi-armed bandits and sequential testing designs.
- Bayesian analysis methods for continuous learning.
- Causal inference techniques for understanding true experimental effects.
- Meta-analysis for combining results across multiple experiments.
- Machine learning model A/B testing for algorithmic improvements.

## Deliverables

### Experiment Design Document
```
Experiment: [Hypothesis Name]
Hypothesis: [Testable prediction with measurable outcome]
Success Metrics: [Primary KPI with success threshold]
Secondary Metrics: [Additional measurements and guardrail metrics]
Type: [A/B test, Multi-variate, Feature flag rollout]
Population: [Target user segment and criteria]
Sample Size: [Required users per variant for 80% power]
Duration: [Minimum runtime for statistical significance]
Variants:
- Control: [Current experience]
- Variant A: [Treatment description and rationale]
Risk Assessment: [Negative impact scenarios and rollback procedures]
```

### Experiment Results Report
```
Decision: [Go/No-Go with clear rationale]
Primary Metric Impact: [% change with confidence interval]
Statistical Significance: [P-value and confidence level]
Business Impact: [Revenue/conversion/engagement effect]
Sample Size: [Users per variant with data quality notes]
Segment Analysis: [Performance across user segments]
Key Insights: [Primary findings and unexpected results]
Follow-up Experiments: [Next iteration opportunities]
Organizational Learnings: [Broader insights for future experiments]
```

## Success Metrics

- 95% of experiments reach statistical significance with proper sample sizes.
- Experiment velocity exceeds 15 experiments per quarter.
- 80% of successful experiments are implemented and drive measurable business impact.
- Zero experiment-related production incidents or user experience degradation.
- Organizational learning rate increases with documented patterns and insights.

## Verify

- Hypothesis is stated in 'if X then Y because Z' form before the experiment runs
- Sample size, duration, and primary metric are committed to in writing before reading any results
- Control and treatment are specified concretely (config diff, feature flag, audience filter), not described abstractly
- The experiment record stores raw outcome data, not just the conclusion, so it can be re-analyzed later
- Results report effect size and a confidence interval (or equivalent uncertainty), not only a point estimate
- A 'no decision' or 'inconclusive' branch is allowed in the analysis plan; the agent does not force a winner

More from elophanto/EloPhanto

Skill	Description
12-principles-of-animation	Audit animation code against Disney's 12 principles adapted for web. Use when reviewing motion, implementing animations, or checking animation quality. Outputs file:line findings.
accessibility-auditing	Audit interfaces against WCAG 2.2 standards, test with assistive technologies, and ensure inclusive design beyond what automated tools catch. Adapted from msitarzewski/agency-agents.
agency-phase-0-discovery	Intelligence and discovery phase — validate opportunity before committing resources. Adapted from msitarzewski/agency-agents.
agency-phase-1-strategy	Strategy and architecture phase — define what to build, how to structure it, and what success looks like. Adapted from msitarzewski/agency-agents.
agency-phase-2-foundation	Foundation and scaffolding phase — build technical and operational foundation before feature development. Adapted from msitarzewski/agency-agents.
agency-phase-3-build	Build and iterate phase — implement all features through continuous Dev-QA loops with orchestrated multi-agent sprints. Adapted from msitarzewski/agency-agents.
agency-phase-4-hardening	Quality and hardening phase — the final quality gauntlet proving production readiness with evidence. Adapted from msitarzewski/agency-agents.
agency-phase-5-launch	Launch and growth phase — coordinate go-to-market execution across all channels for maximum impact. Adapted from msitarzewski/agency-agents.
agency-phase-6-operate	Operate and evolve phase — sustained operations with continuous improvement for live products. Adapted from msitarzewski/agency-agents.
agency-strategy	NEXUS multi-agent orchestration strategy — the complete operational playbook for coordinating specialized AI agents across project phases. Adapted from msitarzewski/agency-agents.