monte-carlo-incident-response
$
npx mdskill add monte-carlo-data/mc-agent-toolkit/monte-carlo-incident-responseExecute end-to-end incident response for data failures.
- Handles active alerts, broken tables, and pipeline failures.
- Depends on Monte Carlo skills for triage, root cause, and remediation.
- Decides workflow steps based on alert context and user intent.
- Delivers a sequenced investigation and fix plan to the user.
SKILL.md
.github/skills/monte-carlo-incident-responseView on GitHub ↗
---
name: monte-carlo-incident-response
description: Orchestrate incident response — triage, root cause, remediate, prevent recurrence. USE WHEN active alerts, data broken, stale, pipeline failure, or investigate and fix a data incident.
when_to_use: |
Invoke when the user has an active data incident to handle — alerts firing, a table looks stale or broken, a pipeline failed, or they want to investigate root cause on a named table.
Example triggers: "my orders table is stale, figure out why", "I have an unresolved alert on X, help me investigate", "alerts are firing — what should I do?", "investigate the most critical alert".
Covers the full workflow: triage (classify/prioritize alerts) → root cause analysis (lineage, freshness history, query changes) → remediation → prevent recurrence.
Do NOT invoke for coverage or "what should I monitor" requests (use proactive-monitoring instead) or for creating a specific monitor on a known table (use monitoring-advisor).
bucket: Agent-routing
version: 1.0.0
---
# Monte Carlo Incident Response Workflow
This workflow orchestrates the full lifecycle of a data incident by sequencing
existing Monte Carlo skills. It does not contain investigation or remediation
logic itself — each step loads the relevant skill's SKILL.md which has the
actual instructions.
## When to activate this workflow
Activate when:
- Context detection routes here (active alerts detected + incident intent)
- User invokes `/mc-incident-response`
- User asks to "respond to an incident", "handle this alert", "triage and fix"
- User describes a data quality problem: "data is broken", "table is stale", "alert firing"
## When NOT to activate this workflow
- User wants to create monitors or check coverage without an active incident — use proactive monitoring workflow
- User is editing a dbt model — defer to `prevent` skill (auto-activates via hooks)
- User wants to check table health without an incident context — use `asset-health` directly
- A skill is already active and handling the user's request
---
## Workflow Steps
```
Step 1 (conditional): Triage — when user has multiple/unknown alerts
Step 2: Root Cause Analysis — the core investigation
Step 3: Remediation — fix or escalate
Step 4 (optional): Prevent Recurrence — add monitoring
```
### Determine entry point
Before starting, determine which step to enter based on the user's context:
- **User has no specific alert** ("I have alerts firing", "what's going on?") → Start at **Step 1: Triage**
- **User has a specific alert ID or table** ("alert ABC-123", "stg_payments is stale") → Skip to **Step 2: Root Cause Analysis**
- **User knows the root cause** ("the ETL job failed, help me fix it") → Skip to **Step 3: Remediation**
- **Ambiguous** → Ask: "Do you have a specific alert or table you want to investigate, or should I check your recent alerts first?"
---
### Step 1: Triage (conditional)
**Skill:** Read and follow `../automated-triage/SKILL.md`
**Goal:** Fetch recent alerts, score them by confidence and impact, identify which ones need investigation.
**When to run:** Only when the user doesn't already have a specific alert or incident to investigate. This step helps narrow down "I have alerts" into "these specific alerts need attention."
**Scope MCP calls tightly.** On large accounts, broad queries return hundreds of results, overflow the tool-result token limit, spill to disk, and force chunk reads — burning user tokens and exhausting the turn budget. Minimum scoping for tools this workflow touches:
- `get_alerts` → time filter (`created_after`, default last 7 days) + at least one of `warehouse`, `table_names`, `severity`
- `search` → needed to resolve a table name to its MCON (`get_table` requires MCON). Always pass `limit` (e.g. 5), the table name as `query`, and filter by `warehouse_uuid` or `database`/`schema`. `warehouse_types` alone is too broad. If multiple matches return: (1) auto-pick the match whose `warehouse_display_name` matches the user's named warehouse — do NOT stop to ask; (2) failing that, prefer the `is_key_asset: true` match; (3) only ask the user when none of these resolve it
- `get_monitors` → filter by `mcons` or `warehouse_uuid`
If scope is missing, ask the user before calling: "Which warehouse?", "How far back — today, this week?", "Any specific severity?".
**Transition to Step 2:** Once high-priority alert(s) are identified, tell the user:
> "I've identified [N] high-priority alerts. Let me investigate the root cause of [specific alert/table]. Moving to root cause analysis."
Then proceed to Step 2 with the identified alert context.
---
### Step 2: Root Cause Analysis
**Skill:** Read and follow `../analyze-root-cause/SKILL.md`
**Goal:** Investigate why the issue occurred — trace lineage, check ETL changes, analyze query modifications, profile data.
**This is the core step.** Most workflow entries start here.
**Investigate linearly — do not re-call tools.** Walk through the investigation once: (1) find the table, (2) fetch its alerts and freshness, (3) check lineage, (4) check recent queries/ETL. Call each tool at most once per table. If a tool result is insufficient, move to the next signal rather than re-calling with different params — burning turns on redundant calls exhausts the budget before the root cause is reached.
**Transition to Step 3:** When the root cause is identified (or the investigation reaches its limit), summarize findings and tell the user:
> "Root cause identified: [summary]. Would you like me to help remediate this, or is the investigation sufficient?"
If the user wants to proceed, move to Step 3. If they say "that's enough", stop.
---
### Step 3: Remediation
**Skill:** Read and follow `../remediation/SKILL.md`
**Goal:** Fix the issue using available tools, or escalate with full context if the fix requires actions outside the agent's capability.
**Transition to Step 4:** After remediation is complete (fix applied or escalation documented), offer prevention:
> "The issue has been [fixed/escalated]. The root cause was [X]. Want me to help add a monitor to detect this type of issue earlier next time?"
If the user says yes, move to Step 4. If no, the workflow is complete.
---
### Step 4: Prevent Recurrence (optional)
**Skill:** Read and follow `../monitoring-advisor/SKILL.md`
When loading monitoring-advisor for this step, frame the request as direct monitor creation — not coverage analysis. The user already knows what they want to monitor (the thing that just broke). Example framing:
> "Based on the incident, I recommend adding a [freshness/volume/validation] monitor on [table]. Let me create the monitor configuration."
**Goal:** Add or update a monitor to catch this class of issue in the future.
**Do not force this step.** It is optional — offer it after remediation, and respect if the user declines.
---
## Orchestration Rules
- **Users can enter at any step.** The entry point section above determines where to start.
- **Each step loads the actual skill's SKILL.md** via relative path. This workflow does not replicate skill logic — it sequences it.
- **Context carries forward** through conversation naturally. Alert IDs, table names, root cause findings from earlier steps are available to later steps without explicit state passing.
- **No state tracking or hooks.** This is purely prompt-driven sequencing.
- **User can exit anytime.** If they say "that's enough" or "stop", respect it immediately.
- **Do not skip back.** The workflow moves forward. If the user wants to re-investigate after remediation, they can start a new workflow or invoke a skill directly.
More from monte-carlo-data/mc-agent-toolkit
- automated-triageTriage Monte Carlo alerts interactively or build an automated workflow. Fetch, score, and troubleshoot alerts using MCP tools now, or design a reusable workflow that runs on a schedule.
- connection-auth-rulesBuild a Connection Auth Rules for a Monte Carlo connection type. Fetches live connector schemas and transform steps from the apollo-agent repo.
- generate-validation-notebookGenerate SQL validation notebooks for dbt changes. Pass a GitHub PR URL or local dbt repo path.
- monte-carlo-analyze-root-cause|
- monte-carlo-asset-healthCheck the health of a data table/asset using Monte Carlo. Activates on "how is table X", "check health of X", "is X healthy", "status of X", "check on X table", or any health/status question about a data asset.
- monte-carlo-context-detectionRoute data-related requests to the right Monte Carlo skill or workflow. USE WHEN alerts, incidents, data broken, stale, coverage gaps, data quality, or any ambiguous data observability request.
- monte-carlo-instrument-agentInstrument a new AI agent in a Python codebase for Monte Carlo Agent Observability. Detects AI libraries, installs the Monte Carlo OpenTelemetry SDK, and proposes tracing setup and decorator placements as diffs. Asks before editing any file.
- monte-carlo-manage-macCreate, edit, validate, and import Monitors-as-Code YAML files. CLI-first; falls back to MC MCP tools, then manual validation.
- monte-carlo-monitoring-advisorAnalyze data coverage, create monitors for warehouse tables and AI agents. Covers coverage gaps, use-case analysis, data monitor creation, and agent observability.
- monte-carlo-performance-diagnosis|