monte-carlo-performance-diagnosis
$
npx mdskill add monte-carlo-data/mc-agent-toolkit/monte-carlo-performance-diagnosisThis skill helps diagnose data pipeline performance issues using Monte Carlo's cross-platform observability data. It works across Airflow, dbt, Databricks, and warehouse query engines to find bottlenecks, detect regressions, and identify root causes.
SKILL.md
.github/skills/monte-carlo-performance-diagnosisView on GitHub ↗
---
name: monte-carlo-performance-diagnosis
description: |
Diagnoses pipeline performance issues -- slow jobs, expensive queries,
latency trends -- using Monte Carlo's cross-platform observability.
Uses a tiered investigation approach: discover problems, bridge to
affected tables, then drill into root causes. Activates when a user
asks about slow pipelines, expensive queries, or performance regressions.
bucket: Optimize
version: 1.0.0
---
# Monte Carlo Performance Diagnosis Skill
This skill helps diagnose data pipeline performance issues using Monte Carlo's cross-platform observability data. It works across Airflow, dbt, Databricks, and warehouse query engines to find bottlenecks, detect regressions, and identify root causes.
Reference files live next to this skill file. **Use the Read tool** (not MCP resources) to access them:
- Tiered investigation approach: `references/investigation-tiers.md` (relative to this file)
- Query analysis patterns: `references/query-analysis.md` (relative to this file)
## When to activate this skill
Activate when the user:
- Asks about slow pipelines, jobs, or queries
- Wants to find expensive or costly queries
- Mentions performance regressions or degradation
- Asks "why is this pipeline slow?" or "what's using the most compute?"
- Wants to compare performance over time or find bottleneck tasks
- Asks about failed or futile query patterns
## When NOT to activate this skill
Do not activate when the user is:
- Investigating data quality issues (use the prevent skill)
- Looking at storage costs (use the storage-cost-analysis skill)
- Creating monitors (use the monitoring-advisor skill)
- Just querying data or exploring table contents
## Prerequisites
The following MCP tools must be available (connect to Monte Carlo's MCP server):
**Discovery tools (Tier 1):**
- `get_jobs_performance` -- find slow/failing jobs across Airflow, dbt, Databricks
- `get_top_slow_queries` -- find slowest query groups by total runtime
**Bridge tool:**
- `get_tables_for_job` -- convert job MCONs to table MCONs
**Diagnosis tools (Tier 2):**
- `get_tasks_performance` -- drill into a job's individual tasks
- `get_change_timeline` -- unified timeline of query changes, volume shifts, Airflow/dbt failures
- `get_query_rca` -- root cause analysis for failed/futile queries
- `get_query_latency_distribution` -- latency trend over time
- `get_asset_lineage` -- trace upstream/downstream impact
**Supporting tools:**
- `get_warehouses` -- list available warehouses
## Workflow
### Step 1: Identify the scope
Determine what the user wants to investigate:
- **Specific job/pipeline**: User mentions a job name or pipeline
- **Specific table**: User mentions a table that's slow to update
- **General discovery**: User wants to find what's slow
Call `get_warehouses` to list available warehouses. Match the user's context to a warehouse.
### Step 2: Tier 1 -- Discovery
If you don't have specific MCONs to investigate, start with discovery:
1. **Find slow jobs**: Call `get_jobs_performance` with optional `integration_type` filter (AIRFLOW, DATABRICKS, DBT) if the user specifies a platform.
- Results include: job name, average duration, trend (7-day), run count, failure rate
- Look for: high `avgDuration`, negative `runDurationTrend7d`, high failure rates
2. **Find expensive queries**: Call `get_top_slow_queries` with optional `warehouse_id` and `query_type` ("read" for SELECTs, "write" for INSERT/CREATE/MERGE).
- Results include: query hash, total runtime, average runtime, run count
- Look for: queries with high total runtime or high individual execution time
Present the top findings to the user before drilling deeper. A typical investigation needs only 3-7 tool calls.
**If both discovery tools return no results:** Tell the user no performance issues were found in the current time window. Suggest broadening the scope (different warehouse, longer time range, or a different platform filter).
### Step 3: Bridge -- Job to Tables
After Tier 1 identifies problematic jobs, convert to table MCONs:
Call `get_tables_for_job(job_mcon=..., integration_type=...)` using the `integration_type` from the job performance results.
This gives you the table MCONs needed for Tier 2 investigation.
### Step 4: Tier 2 -- Diagnosis
Now drill into root causes using the MCONs from discovery or the bridge:
1. **Task bottleneck**: Call `get_tasks_performance` to find which specific task in a job is the bottleneck.
2. **What changed?** Call `get_change_timeline` -- this is your most powerful tool. It returns a unified timeline of:
- Query text changes (schema modifications, new JOINs, filter changes)
- Volume shifts (row count spikes/drops)
- Airflow task failures
- dbt model failures
All in one call. Look for correlations: "query changed on day X, runtime doubled on day X+1."
3. **Why are queries failing?** Call `get_query_rca` to get root cause analysis:
- **Failed** queries: errors, timeouts, permission issues
- **Futile** queries: queries that run but produce no useful output
- Patterns are pre-computed -- the tool groups failures by cause
4. **Is latency degrading?** Call `get_query_latency_distribution` to see the trend:
- Compare p50 vs p95 -- if p95 >> p50 (>5x), the problem is outlier queries
- Look for step-changes in latency (sudden increase = regression)
- For step-change / regression-time-localization use cases, pass `bucket="1h"`. The default downsamples to daily on windows ≥ 3 days, which hides hour-level steps.
5. **Trace impact**: Call `get_asset_lineage` with `direction="DOWNSTREAM"` to see what's affected by a slow table, or `direction="UPSTREAM"` to find what feeds it.
### Step 5: Present findings
Structure your response as:
1. **Problem summary**: What's slow and by how much (with exact numbers from tools)
2. **Root cause**: What changed or what's causing the issue
3. **Impact**: What downstream systems are affected
4. **Recommendations**: Specific actions to fix the issue
### Important rules
- **Quote tool numbers exactly.** If a tool returns "1282 runs, avg 22.5s", say exactly that. Never round, estimate, or fabricate numbers.
- **Always compare to baselines.** Use 7-day trend data (`runDurationTrend7d`) to distinguish regressions from normal variance. Flag if trend data has less than 0.1 confidence.
- **Stop when you have a root cause.** 3-7 tool calls is typical. More than 10 means you're over-investigating.
- **Read vs write queries**: When the user asks about "reads" or "read queries", filter with `query_type="read"`. When they ask about "writes", use `query_type="write"`. Do NOT mix them.
- **Never expose MCONs, UUIDs, or internal identifiers** to the user. Use human-readable names.
- **Cross-platform**: This skill works across Airflow, dbt, and Databricks. Note which platform each finding comes from.
More from monte-carlo-data/mc-agent-toolkit
- automated-triageTriage Monte Carlo alerts interactively or build an automated workflow. Fetch, score, and troubleshoot alerts using MCP tools now, or design a reusable workflow that runs on a schedule.
- connection-auth-rulesBuild a Connection Auth Rules for a Monte Carlo connection type. Fetches live connector schemas and transform steps from the apollo-agent repo.
- generate-validation-notebookGenerate SQL validation notebooks for dbt changes. Pass a GitHub PR URL or local dbt repo path.
- monte-carlo-analyze-root-cause|
- monte-carlo-asset-healthCheck the health of a data table/asset using Monte Carlo. Activates on "how is table X", "check health of X", "is X healthy", "status of X", "check on X table", or any health/status question about a data asset.
- monte-carlo-context-detectionRoute data-related requests to the right Monte Carlo skill or workflow. USE WHEN alerts, incidents, data broken, stale, coverage gaps, data quality, or any ambiguous data observability request.
- monte-carlo-incident-responseOrchestrate incident response — triage, root cause, remediate, prevent recurrence. USE WHEN active alerts, data broken, stale, pipeline failure, or investigate and fix a data incident.
- monte-carlo-instrument-agentInstrument a new AI agent in a Python codebase for Monte Carlo Agent Observability. Detects AI libraries, installs the Monte Carlo OpenTelemetry SDK, and proposes tracing setup and decorator placements as diffs. Asks before editing any file.
- monte-carlo-manage-macCreate, edit, validate, and import Monitors-as-Code YAML files. CLI-first; falls back to MC MCP tools, then manual validation.
- monte-carlo-monitoring-advisorAnalyze data coverage, create monitors for warehouse tables and AI agents. Covers coverage gaps, use-case analysis, data monitor creation, and agent observability.