analysis-qa

$npx mdskill add vm0-ai/vm0-skills/analysis-qa

Validate data analysis for accuracy before stakeholder presentation.

  • Catches statistical errors, join flaws, and metric definition mistakes.
  • Relies on SQL execution, data schema access, and statistical libraries.
  • Compares computed results against expected values and domain rules.
  • Outputs a detailed checklist with flagged issues and documentation.
SKILL.md
.github/skills/analysis-qaView on GitHub ↗
---
name: analysis-qa
description: Quality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.
---

## Review Checklist

Work through every section below before presenting findings to stakeholders.

### Data Foundation

- [ ] **Correct sources**: Confirmed that the tables and datasets used are the appropriate ones for this question
- [ ] **Freshness**: Data recency is sufficient; the "data as of" date is noted
- [ ] **Coverage**: No unexpected time gaps or missing segments in the dataset
- [ ] **Null treatment**: Null rates in critical columns have been reviewed; nulls are excluded, filled, or explicitly flagged
- [ ] **Duplicate control**: Row counts confirm no double-counting from faulty joins or repeated source records
- [ ] **Filter accuracy**: Every WHERE clause and filter condition has been verified; nothing is accidentally excluded or included

### Computation Integrity

- [ ] **GROUP BY correctness**: All non-aggregated columns appear in GROUP BY; the aggregation grain matches the analytical question
- [ ] **Denominator validity**: Rates and percentages use the intended base population; division by zero is prevented
- [ ] **Temporal alignment**: Compared periods span equal durations; partial periods are either excluded or called out
- [ ] **Join behavior**: JOIN types are intentional (INNER vs. LEFT); many-to-many relationships have not silently inflated totals
- [ ] **Metric fidelity**: Calculated metrics align with how the business defines them; any deviations are documented
- [ ] **Additive consistency**: Sub-totals sum to the reported total where expected; non-additive cases (overlap, double-counting) are explained

### Plausibility Assessment

- [ ] **Order of magnitude**: Key figures fall within a believable range; revenue is non-negative; percentages stay within 0-100%
- [ ] **Trend coherence**: Time series show no unexplained jumps or drops
- [ ] **External agreement**: Headline numbers align with dashboards, finance reports, or earlier analyses
- [ ] **Ballpark math**: Total revenue roughly equals per-user revenue times user count, etc.
- [ ] **Boundary behavior**: Results make sense for edge cases — a single day, a single user, a single category

### Presentation Quality

- [ ] **Accurate visuals**: Bar charts begin at zero; axes have labels; scales are consistent across panels
- [ ] **Clean formatting**: Numbers use appropriate precision, consistent currency/percent notation, and thousands separators
- [ ] **Descriptive titles**: Headings convey the insight, not just the metric name; date ranges are included
- [ ] **Transparent caveats**: Limitations and assumptions are stated up front
- [ ] **Reproducibility**: Another analyst could recreate the work from the provided documentation

## Recognizing Common Mistakes

### Inflated Counts from Many-to-Many Joins

**What goes wrong**: Joining two tables with a many-to-many relationship silently multiplies rows, blowing up counts and sums.

**Detection method**:
```sql
-- Compare row counts before and after the join
SELECT COUNT(*) FROM orders;          -- 1,000
SELECT COUNT(*) FROM orders o
JOIN line_items li ON o.id = li.order_id;  -- 3,500 (unexpected inflation)
```

**Prevention**:
- Always compare pre-join and post-join row counts
- Verify the actual cardinality of the join relationship
- Use `COUNT(DISTINCT o.id)` to count entities accurately through multi-row joins

### Survivorship Bias

**What goes wrong**: The analysis only covers entities that still exist, ignoring those that were removed, churned, or failed.

**Typical scenarios**:
- Studying behavior of "active users" while ignoring everyone who left
- Benchmarking against "companies on our platform" while skipping those who evaluated and moved on
- Analyzing traits of "successful" cases without any "unsuccessful" comparison group

**Prevention**: Before drawing conclusions, ask: "Who is absent from this dataset, and would their presence change the story?"

### Partial Period Comparisons

**What goes wrong**: A month, week, or quarter that is still in progress gets compared to a completed one.

**Typical scenarios**:
- "January revenue is $500K vs. December's $800K" when January is only half over
- "Signups are down this week" when checked on Tuesday against a full prior week

**Prevention**: Restrict comparisons to completed periods, or normalize by matching the same number of elapsed days.

### Shifting Denominators

**What goes wrong**: The population used as a denominator changes between periods, making rate comparisons invalid.

**Typical scenarios**:
- Conversion rate appears to improve because the definition of "eligible visitor" was narrowed
- Churn rate shifts because "active user" was redefined mid-analysis

**Prevention**: Lock in consistent definitions across every period being compared. Flag any definition changes.

### Averaging Pre-Computed Averages

**What goes wrong**: Taking the mean of group-level averages ignores differences in group size, producing an incorrect overall figure.

**Illustration**:
- Segment A: 100 customers, $50 average order
- Segment B: 10 customers, $200 average order
- Incorrect overall average: ($50 + $200) / 2 = $125
- Correct weighted average: (100 * $50 + 10 * $200) / 110 = $63.64

**Prevention**: Always compute averages from individual records. Never take the mean of already-aggregated means.

### Timezone Inconsistencies

**What goes wrong**: Different source systems record timestamps in different zones, causing misaligned daily rollups and join mismatches.

**Typical scenarios**:
- Backend events logged in UTC while the reporting layer uses US Pacific
- Two tables that define "today" with different cutoff hours

**Prevention**: Convert all timestamps to a single reference zone (UTC is the safest default) before any analysis. State the timezone in the deliverable.

### Circular Segmentation

**What goes wrong**: Segments are defined using the very outcome being measured, creating tautological findings.

**Typical scenarios**:
- "Users who finished onboarding retain better" — finishing onboarding is itself a retention signal
- "Power users drive more revenue" — revenue generation is what made them power users

**Prevention**: Base segment definitions on characteristics measured before the outcome period, not on the outcome itself.

## Sanity-Checking Results

### Quick Magnitude Tests

| Metric Category | Validation Approach |
|---|---|
| User counts | Cross-reference against known DAU/MAU benchmarks |
| Revenue totals | Compare to known ARR or recent financial reports |
| Conversion rates | Must be 0-100%; compare to dashboard baselines |
| Growth rates | Is 50%+ month-over-month realistic, or does it signal a data problem? |
| Averages | Given the distribution, does this number feel right? |
| Segment shares | Do percentage breakdowns sum to approximately 100%? |

### Cross-Validation Approaches

1. **Dual calculation**: Derive the same metric via two independent query paths and confirm they match
2. **Record-level spot checks**: Select a handful of specific entities and manually trace their numbers end to end
3. **Benchmark comparison**: Verify against published dashboards, finance systems, or prior analysis outputs
4. **Arithmetic reversal**: If total revenue is X and there are N users, does X / N approximate the reported per-user figure?
5. **Micro-slice testing**: Filter to a single day, user, or category and confirm the micro-result is sensible

### Signals That Demand Investigation

- Any metric swinging more than 50% period-over-period without a clear explanation
- Totals or sums that land on suspiciously round numbers (possible filter or default-value artifact)
- Rates pegged at exactly 0% or 100% (may indicate missing data rather than perfect outcomes)
- Results that confirm the hypothesis too neatly (real data is almost always messy)
- Identical values appearing across different time periods or segments (suggests a dimension is being ignored)

## Ensuring Reproducibility

### Analysis Write-Up Template

Every substantial analysis should ship with this documentation:

```markdown

## Analysis: [Title]

### Business Question
[The precise question this work answers]

### Sources
- Table: [schema.table_name] (snapshot date: [date])
- Table: [schema.other_table] (snapshot date: [date])
- External file: [filename] (origin: [description])

### Metric and Segment Definitions
- [Metric A]: [Precise calculation formula]
- [Segment X]: [Exact inclusion/exclusion criteria]
- [Time window]: [Start] through [end], [timezone]

### Analytical Approach
1. [First step and its purpose]
2. [Second step]
3. [Third step]

### Assumptions and Known Limitations
- [Assumption and why it holds]
- [Limitation and its potential effect on conclusions]

### Results
1. [Finding with supporting evidence]
2. [Finding with supporting evidence]

### Queries
[All SQL and code used, annotated with comments]

### Warnings for the Reader
- [Anything the audience should weigh before acting on these results]
```

### Annotating Analytical Code

For SQL or Python that others may reuse:

```python
"""
Title: Monthly Cohort Retention
Author: [Name]
Created: [Date]
Sources: events, users
Last cross-checked: [Date] — matched dashboard within 2%

Objective:
    Build monthly retention cohorts anchored on each user's first event date.

Assumptions:
    - "Active" = at least one recorded event in the calendar month
    - Internal and test accounts excluded (user_type != 'internal')
    - All timestamps normalized to UTC

Output:
    Retention grid: rows are cohort months, columns are months since first event.
    Cell values are retention percentages (0-100).
"""
```

### Maintaining an Audit Trail

- Store all queries and scripts in version control or a shared knowledge base
- Record the exact data snapshot date used for each analysis run
- When refreshing a recurring analysis, document what changed and why
- Link current results to prior versions so trends in the analysis itself are traceable
More from vm0-ai/vm0-skills