stats-methods
$
npx mdskill add vm0-ai/vm0-skills/stats-methodsAnalyze data distributions to validate claims and forecast trends.
- Detects anomalies, tests hypotheses, and evaluates A/B experiment results.
- Integrates with data pipelines to ingest raw numeric datasets.
- Selects metrics based on data symmetry and outlier presence.
- Outputs clear reports with confidence intervals and distribution charts.
SKILL.md
.github/skills/stats-methodsView on GitHub ↗
---
name: stats-methods
description: Apply statistical techniques to data — descriptive statistics, distributions, hypothesis testing, A/B test evaluation, outlier detection, trend analysis, correlation, forecasting, p-values, confidence intervals, significance testing, and avoiding common statistical pitfalls like Simpson's paradox and survivorship bias.
---
## Summarizing Numeric Data
### Choosing a Center Metric
| Data Characteristic | Recommended Measure | Rationale |
|---|---|---|
| Symmetric, outlier-free | Mean | Maximally efficient estimator |
| Asymmetric or outlier-heavy | Median | Unaffected by extreme values |
| Non-numeric or ranked | Mode | Sole option for categorical data |
| Business KPIs like revenue per user | Both mean and median | The gap between them reveals skewness |
**Guideline:** For any business metric, present the mean alongside the median. When they differ substantially, the distribution is skewed and the mean by itself will mislead.
### Quantifying Variability
- **Standard deviation**: Typical distance from the mean; best suited to bell-shaped data.
- **IQR (interquartile range)**: Gap between the 25th and 75th percentiles; resistant to extreme values.
- **Coefficient of variation**: Standard deviation divided by the mean; enables apples-to-apples variability comparison across different scales.
- **Range**: Maximum minus minimum; gives a quick but outlier-sensitive view of data spread.
### Telling the Story with Percentiles
Go beyond averages by reporting a percentile ladder:
```
p1: Floor of the distribution (bottom 1%)
p5: Lower boundary of typical values
p25: First quartile
p50: Median — the representative observation
p75: Third quartile
p90: Top 10% threshold (heavy users, premium tier)
p95: Upper boundary of typical values
p99: Extreme top 1%
```
**Sample insight:** "Half of all sessions last under 4.2 minutes, yet the top decile exceeds 22 minutes, which pushes the average to 7.8 minutes."
### Characterizing Distributions
For every numeric column, document:
- **Shape**: Gaussian, right-tailed, left-tailed, bimodal, uniform, heavy-tailed
- **Center**: Mean vs. median and the magnitude of their difference
- **Spread**: Standard deviation or IQR as appropriate
- **Extremes**: Count and severity of outliers
- **Boundaries**: Natural limits such as zero floors or 100% ceilings
## Trend Analysis and Projection
### Smoothing Noisy Time Series
```python
# Weekly smoother — useful for daily data with weekday/weekend cycles
df['smooth_7'] = df['metric'].rolling(window=7, min_periods=1).mean()
# Four-week smoother — irons out both weekly and monthly rhythms
df['smooth_28'] = df['metric'].rolling(window=28, min_periods=1).mean()
```
### Period Comparisons
- **Week-over-week**: Same weekday, one week apart
- **Month-over-month**: Calendar month versus prior calendar month
- **Year-over-year**: The gold standard for businesses with seasonal patterns
- **Same-calendar-day**: Matches the exact date from the prior year
### Measuring Growth
```
Simple rate: (current - prior) / prior
CAGR: (final / initial) ^ (1 / n_years) - 1
Log rate: ln(current / prior) # more stable for volatile series
```
### Spotting Seasonal Cycles
1. Visually inspect the raw series first
2. Aggregate by day-of-week to surface weekly rhythms
3. Aggregate by calendar month to surface annual rhythms
4. Always use year-over-year or matched-period comparisons to separate trend from seasonality
### Lightweight Forecasting Approaches
For analysts who need quick projections rather than full modeling:
- **Naive**: Forecast equals the most recent observation. Serves as the minimum-viable baseline.
- **Seasonal naive**: Forecast equals the value from the same period in the prior cycle.
- **Linear extrapolation**: Fit a straight line to recent history. Only appropriate when the trend is clearly linear.
- **Trailing average**: Use a rolling mean as the projected value.
**Always express forecasts as ranges, not point estimates:**
- Good: "Next month should bring 10,000 to 12,000 registrations based on the trailing quarter"
- Misleading: "Next month will yield exactly 11,234 registrations"
**Hand off to a specialist** when the pattern is non-linear, multiple seasonal cycles overlap, external drivers (ad spend, holidays) matter, or when forecast precision drives resource decisions.
## Detecting and Handling Outliers
### Identification Techniques
**Z-score approach** (assumes approximate normality):
```python
z = (df['val'] - df['val'].mean()) / df['val'].std()
outliers = df[abs(z) > 3] # beyond 3 standard deviations
```
**IQR fence approach** (works regardless of distribution shape):
```python
q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)
iqr = q3 - q1
lo = q1 - 1.5 * iqr
hi = q3 + 1.5 * iqr
outliers = df[(df['val'] < lo) | (df['val'] > hi)]
```
**Percentile cutoff approach** (most straightforward):
```python
outliers = df[(df['val'] < df['val'].quantile(0.01)) |
(df['val'] > df['val'].quantile(0.99))]
```
### What to Do with Outliers
Never strip outliers automatically. Follow this decision process:
1. **Diagnose**: Is this a recording error, a legitimately extreme observation, or a sign of a separate population?
2. **Errors**: Correct or exclude (e.g., negative ages, epoch-zero timestamps)
3. **Legitimate extremes**: Retain but switch to robust summaries (median, IQR)
4. **Distinct populations**: Analyze separately (e.g., enterprise accounts vs. self-serve)
**Document every exclusion**: "We set aside 47 records (0.3% of the dataset) with order values above $50K; these bulk enterprise transactions are covered in a separate section."
### Detecting Anomalies in Time Series
1. Establish an expected baseline (rolling average or year-ago value)
2. Compute the residual: actual minus expected
3. Flag residuals exceeding 2-3 standard deviations of historical residuals
4. Differentiate one-off spikes (point anomalies) from lasting shifts (change points)
## Hypothesis Testing Essentials
### When It Applies
Use formal testing whenever you need to distinguish a real signal from random noise:
- Evaluating A/B experiment results
- Measuring the impact of a product change (before vs. after)
- Comparing metrics across customer segments
### Step-by-Step Process
1. **State the null (H0)**: No difference exists (default position)
2. **State the alternative (H1)**: A difference exists
3. **Set the significance threshold (alpha)**: 0.05 is standard (5% false-positive tolerance)
4. **Calculate the test statistic and p-value**
5. **Decide**: p < alpha means sufficient evidence to reject H0
### Selecting the Right Test
| Question | Appropriate Test | Conditions |
|---|---|---|
| Two group means differ? | Independent samples t-test | Roughly normal, two groups |
| Two conversion rates differ? | Proportions z-test | Binary outcomes |
| Same entities measured twice? | Paired t-test | Pre/post on identical subjects |
| Three or more group means? | ANOVA | Multiple variants or segments |
| Non-normal data, two groups? | Mann-Whitney U | Skewed or ordinal metrics |
| Two categorical variables related? | Chi-squared test | Frequency table data |
### Beyond p-values: Practical Impact
A statistically significant result only means the effect is unlikely due to chance. It does not guarantee the effect matters in practice. Always accompany test results with:
- **Effect magnitude**: "Variant B lifted conversion by 0.3 percentage points"
- **Confidence interval**: The plausible range of the true effect
- **Business translation**: Revenue, user, or efficiency implications
### Sample Size Awareness
- Small samples yield unreliable conclusions even when p-values look good
- Proportions require roughly 30 or more events per group for baseline reliability
- Detecting subtle effects (e.g., a 1-point conversion shift) can demand thousands of observations per arm
- When data is limited, say so: "With 200 observations per group, effects smaller than X% would likely go undetected"
## Guarding Against Statistical Pitfalls
### Correlation vs. Causation
Whenever a correlation surfaces, explicitly evaluate:
- **Reverse direction**: Perhaps B drives A rather than A driving B
- **Hidden third factor**: Some unmeasured variable C could be behind both
- **Coincidence**: Enough variable pairs will show spurious associations
**Safe phrasing**: "Users who adopt feature X exhibit 30% higher retention"
**Unsafe phrasing**: "Feature X causes 30% higher retention" (requires experimental evidence)
### The Multiple Testing Trap
Running many tests inflates false positives:
- At alpha = 0.05, testing 20 metrics yields roughly one spurious hit by chance
- If you explored numerous segments before finding the "interesting" one, acknowledge that
- Apply Bonferroni correction (alpha / number of tests) or transparently report total tests conducted
### Simpson's Paradox
An overall trend can invert when you break the data into subgroups:
- Verify that aggregate conclusions hold within each key segment
- Classic scenario: total conversion rises while every segment's conversion falls, because traffic shifted toward a naturally higher-converting segment
### Survivorship Bias
Your dataset only contains entities that persisted long enough to be recorded:
- Studying current users ignores everyone who already left
- Profiling winning products overlooks the failures
- Routinely ask: "Who is absent from this data, and would including them change the conclusion?"
### Ecological Fallacy
Group-level patterns may not describe individuals:
- "Nations with higher X tend to have higher Y" does not mean the same holds per person
- Resist applying aggregate statistics to individual-level predictions
### Illusory Precision
Overly specific numbers suggest unjustified confidence:
- "Churn will be 4.73% next quarter" implies an accuracy that rarely exists
- Prefer honest ranges: "Churn is likely between 4% and 6%"
- Round to the level of certainty you actually possess
More from vm0-ai/vm0-skills
- account-reconciliationPerform account reconciliations comparing general ledger balances against subledgers, bank statements, or external records. Use for bank reconciliation, GL-to-subledger reconciliation, intercompany reconciliation, balance sheet reconciliation, reconciling item analysis, outstanding item aging, or clearing open items.
- agentphoneBuild AI phone agents with AgentPhone API. Use when the user wants to make phone calls, send/receive SMS, manage phone numbers, create voice agents, set up webhooks, or check usage — anything related to telephony, phone numbers, or voice AI.
- ahrefsAhrefs SEO API for backlink and keyword analysis. Use when user mentions
- amplitudeAmplitude product analytics API. Use when user mentions "Amplitude",
- analysis-qaQuality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.
- anthropic-managed-agentsAnthropic Managed Agents API for programmatically creating, running, and streaming AI agents on Anthropic's cloud infrastructure. Use when the user mentions "Managed Agents", "Anthropic agent sessions", or needs to create/run/stream an Anthropic agent with tool use (bash, git, web), attach GitHub repositories, or inject secrets via Vault. Do NOT use for standard Claude Messages API — use the Claude API skill instead.
- apifyApify web scraping platform. Use when user mentions "scrape website",
- asanaAsana API for tasks and projects. Use when user mentions "Asana", "asana.com",
- atlassianAtlassian API for Confluence and Jira. Use when user mentions "Confluence
- attioAttio REST API for AI-native CRM operations — manage companies, people, deals, and custom objects, plus notes, tasks, lists, and comments. Use when the user mentions "Attio", "CRM record", "create company", "add person", "list entry", "CRM note", or "CRM task".