data-profiling

$npx mdskill add vm0-ai/vm0-skills/data-profiling

Analyze unfamiliar datasets for structure, quality, and coverage.

  • Helps agents onboard new sources and audit data freshness.
  • Depends on database connectors to query schema and statistics.
  • Decides recommendations by comparing null rates and cardinality.
  • Delivers structured reports on columns, tables, and relationships.
SKILL.md
.github/skills/data-profilingView on GitHub ↗
---
name: data-profiling
description: Profile and explore unfamiliar datasets — assess schema structure, column distributions, data quality, null rates, cardinality, outliers, relationships between tables, and temporal coverage. Use when onboarding to a new data source, auditing data freshness, discovering foreign keys, or deciding what to analyze.
---

## Structural Reconnaissance

### Table-Level Inventory

Start every new dataset encounter by answering these questions:

- What is the total row and column count?
- What does each row represent (the grain)?
- Which column or columns uniquely identify a row?
- When was the most recent data loaded?
- What is the earliest date in the dataset?

### Classifying Columns by Role

Assign every column to one of these categories:

- **Key**: Primary keys, foreign keys, entity identifiers
- **Attribute**: Categorical fields used for grouping or filtering (region, status, plan_type)
- **Measure**: Numeric fields intended for aggregation (revenue, duration, score)
- **Timestamp**: Date or datetime fields (created_at, processed_on, event_date)
- **Free-text**: Unstructured strings (descriptions, comments, names)
- **Flag**: Binary true/false indicators
- **Nested**: JSON blobs, arrays, or embedded structures

## Column-Level Inspection

### Universal Checks (All Column Types)

- Null count and null percentage
- Count of distinct values and cardinality ratio (distinct / total rows)
- Top 5-10 most frequent values with their occurrence counts
- Bottom 5 least frequent values (useful for spotting anomalies)

### Numeric Measures

```
Minimum, maximum, mean, median
Standard deviation
Percentile ladder: p1, p5, p25, p75, p95, p99
Count of zeros
Count of negatives (flag if unexpected)
```

### Text and Categorical Fields

```
Shortest length, longest length, average length
Count of empty strings
Format regularity (do values follow a consistent pattern?)
Case consistency (uniform upper, uniform lower, or mixed?)
Count of values with leading or trailing whitespace
```

### Date and Timestamp Fields

```
Earliest date, latest date
Count of nulls
Count of future dates (flag if the domain forbids them)
Distribution across months or weeks
Gaps in expected daily/weekly cadence
```

### Boolean Fields

```
True count, false count, null count
True proportion
```

## Discovering Relationships

After examining columns individually, look for connections:

- **Foreign key candidates**: ID columns whose values likely reference another table
- **Hierarchical dimensions**: Columns that nest naturally (country > region > city)
- **Correlated measures**: Numeric columns that trend together
- **Computed columns**: Fields that appear derived from other columns in the same table
- **Duplicated information**: Columns that carry identical or near-identical content

## Evaluating Data Quality

### Completeness Ratings

Assign each column a tier:

- **Full** (>99% populated): No action needed
- **Mostly full** (95-99%): Investigate the missing values
- **Gaps present** (80-95%): Determine whether the gaps are systematic and whether the column is still usable
- **Sparse** (<80%): Likely unusable without imputation or supplemental data

### Consistency Checks

Scan for:
- **Value normalization failures**: The same concept spelled multiple ways ("USA", "US", "United States", "us")
- **Type mismatches**: Numerics stored as text, dates in inconsistent formats
- **Broken references**: Foreign key values with no matching parent record
- **Rule violations**: Negative quantities, end dates preceding start dates, percentages outside 0-100
- **Cross-column contradictions**: status = 'shipped' while ship_date is null

### Accuracy Warning Signs

Patterns that suggest the data may be unreliable:
- **Sentinel values**: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
- **Suspiciously dominant defaults**: One value appearing far more often than expected
- **Stale records**: The updated_at column shows no recent activity in an active system
- **Physically impossible values**: Ages above 150, dates decades in the future, negative durations
- **Rounding artifacts**: Disproportionate clustering on multiples of 5 or 10 (implies estimation)

### Freshness Assessment

- What is the timestamp of the most recent row?
- How often should this table refresh?
- Is there measurable lag between event occurrence and warehouse arrival?
- Are there missing days or hours in the time series?

## Recognizing Patterns

### Distribution Shapes

When profiling a numeric column, classify its shape:
- **Bell-shaped**: Mean and median nearly equal; symmetric tails
- **Right-skewed**: Long right tail with a few very large values (typical for revenue, session length)
- **Left-skewed**: Long left tail with a few very small values (less common)
- **Bimodal**: Two distinct peaks (suggests two merged populations)
- **Power-law**: A handful of enormous values dominating; many near-zero values (user engagement metrics)
- **Uniform**: Roughly flat across the range (often synthetic or randomly generated)

### Time-Based Patterns

For any temporal data, investigate:
- **Trend**: Persistent upward or downward drift
- **Seasonality**: Recurring cycles — weekly, monthly, quarterly, annual
- **Weekday effects**: Systematic weekday vs. weekend differences
- **Holiday impacts**: Spikes or dips around known holidays
- **Level shifts**: Abrupt, sustained changes in the baseline
- **Isolated anomalies**: Single data points that break the prevailing pattern

### Segment Discovery

Surface natural groupings by:
- Identifying categorical columns with 3 to 20 distinct values
- Comparing metric distributions across each segment value
- Highlighting segments whose behavior diverges meaningfully from the overall
- Checking whether apparent segments contain meaningful sub-segments

### Correlation Scanning

Across numeric columns:
- Build a pairwise correlation matrix
- Highlight strong associations (|r| > 0.7)
- Remember that correlation never establishes causation — note this explicitly
- Probe for non-linear relationships (quadratic, logarithmic) that Pearson r would miss

## Documenting What You Find

### Dataset Summary Template

```markdown

## Table: [schema.table_name]

**Purpose**: [What this table captures]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Approximate Rows**: [count, as of date]
**Refresh Cadence**: [real-time / hourly / daily / weekly]
**Responsible Team**: [owner]

### Important Columns

| Column | Type | Meaning | Sample Values | Notes |
|--------|------|---------|---------------|-------|
| user_id | STRING | Unique user handle | "usr_abc123" | References users.id |
| event_type | STRING | Action category | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | USD transaction amount | 29.99, 149.00 | Null for non-purchases |
| created_at | TIMESTAMP | Event occurrence time | 2024-01-15 14:23:01 | Partition column |

### Join Paths
- Links to `users` via `user_id`
- Links to `products` via `product_id`
- Parent of `event_details` (one-to-many on event_id)

### Known Caveats
- [Document any quality issues]
- [Note analytical gotchas]

### Typical Query Use Cases
- [List common analytical patterns against this table]
```

### Schema Discovery Queries

When working directly against a warehouse, use these patterns:

```sql
-- Enumerate tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Inspect column metadata (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'target_table'
ORDER BY ordinal_position;

-- Rank tables by storage footprint (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row count per table (general approach)
-- Execute individually: SELECT COUNT(*) FROM table_name
```

### Tracing Data Lineage

When navigating an unfamiliar warehouse:

1. Begin at the consumption layer — identify which tables power reports and dashboards
2. Follow dependencies upstream: what feeds those tables?
3. Map the raw / staging / mart architecture
4. Track where transformations enrich, filter, or roll up the data
5. Record any points where data is joined with external sources
More from vm0-ai/vm0-skills