monte-carlo-incident-response

$npx mdskill add monte-carlo-data/mc-agent-toolkit/monte-carlo-incident-response

Execute end-to-end incident response for data failures.

  • Handles active alerts, broken tables, and pipeline failures.
  • Depends on Monte Carlo skills for triage, root cause, and remediation.
  • Decides workflow steps based on alert context and user intent.
  • Delivers a sequenced investigation and fix plan to the user.
SKILL.md
.github/skills/monte-carlo-incident-responseView on GitHub ↗
---
name: monte-carlo-incident-response
description: Orchestrate incident response — triage, root cause, remediate, prevent recurrence. USE WHEN active alerts, data broken, stale, pipeline failure, or investigate and fix a data incident.
when_to_use: |
  Invoke when the user has an active data incident to handle — alerts firing, a table looks stale or broken, a pipeline failed, or they want to investigate root cause on a named table.
  Example triggers: "my orders table is stale, figure out why", "I have an unresolved alert on X, help me investigate", "alerts are firing — what should I do?", "investigate the most critical alert".

  Covers the full workflow: triage (classify/prioritize alerts) → root cause analysis (lineage, freshness history, query changes) → remediation → prevent recurrence.

  Do NOT invoke for coverage or "what should I monitor" requests (use proactive-monitoring instead) or for creating a specific monitor on a known table (use monitoring-advisor).
bucket: Agent-routing
version: 1.0.0
---

# Monte Carlo Incident Response Workflow

This workflow orchestrates the full lifecycle of a data incident by sequencing
existing Monte Carlo skills. It does not contain investigation or remediation
logic itself — each step loads the relevant skill's SKILL.md which has the
actual instructions.

## When to activate this workflow

Activate when:

- Context detection routes here (active alerts detected + incident intent)
- User invokes `/mc-incident-response`
- User asks to "respond to an incident", "handle this alert", "triage and fix"
- User describes a data quality problem: "data is broken", "table is stale", "alert firing"

## When NOT to activate this workflow

- User wants to create monitors or check coverage without an active incident — use proactive monitoring workflow
- User is editing a dbt model — defer to `prevent` skill (auto-activates via hooks)
- User wants to check table health without an incident context — use `asset-health` directly
- A skill is already active and handling the user's request

---

## Workflow Steps

```
Step 1 (conditional): Triage — when user has multiple/unknown alerts
Step 2: Root Cause Analysis — the core investigation
Step 3: Remediation — fix or escalate
Step 4 (optional): Prevent Recurrence — add monitoring
```

### Determine entry point

Before starting, determine which step to enter based on the user's context:

- **User has no specific alert** ("I have alerts firing", "what's going on?") → Start at **Step 1: Triage**
- **User has a specific alert ID or table** ("alert ABC-123", "stg_payments is stale") → Skip to **Step 2: Root Cause Analysis**
- **User knows the root cause** ("the ETL job failed, help me fix it") → Skip to **Step 3: Remediation**
- **Ambiguous** → Ask: "Do you have a specific alert or table you want to investigate, or should I check your recent alerts first?"

---

### Step 1: Triage (conditional)

**Skill:** Read and follow `../automated-triage/SKILL.md`

**Goal:** Fetch recent alerts, score them by confidence and impact, identify which ones need investigation.

**When to run:** Only when the user doesn't already have a specific alert or incident to investigate. This step helps narrow down "I have alerts" into "these specific alerts need attention."

**Scope MCP calls tightly.** On large accounts, broad queries return hundreds of results, overflow the tool-result token limit, spill to disk, and force chunk reads — burning user tokens and exhausting the turn budget. Minimum scoping for tools this workflow touches:

- `get_alerts` → time filter (`created_after`, default last 7 days) + at least one of `warehouse`, `table_names`, `severity`
- `search` → needed to resolve a table name to its MCON (`get_table` requires MCON). Always pass `limit` (e.g. 5), the table name as `query`, and filter by `warehouse_uuid` or `database`/`schema`. `warehouse_types` alone is too broad. If multiple matches return: (1) auto-pick the match whose `warehouse_display_name` matches the user's named warehouse — do NOT stop to ask; (2) failing that, prefer the `is_key_asset: true` match; (3) only ask the user when none of these resolve it
- `get_monitors` → filter by `mcons` or `warehouse_uuid`

If scope is missing, ask the user before calling: "Which warehouse?", "How far back — today, this week?", "Any specific severity?".

**Transition to Step 2:** Once high-priority alert(s) are identified, tell the user:

> "I've identified [N] high-priority alerts. Let me investigate the root cause of [specific alert/table]. Moving to root cause analysis."

Then proceed to Step 2 with the identified alert context.

---

### Step 2: Root Cause Analysis

**Skill:** Read and follow `../analyze-root-cause/SKILL.md`

**Goal:** Investigate why the issue occurred — trace lineage, check ETL changes, analyze query modifications, profile data.

**This is the core step.** Most workflow entries start here.

**Investigate linearly — do not re-call tools.** Walk through the investigation once: (1) find the table, (2) fetch its alerts and freshness, (3) check lineage, (4) check recent queries/ETL. Call each tool at most once per table. If a tool result is insufficient, move to the next signal rather than re-calling with different params — burning turns on redundant calls exhausts the budget before the root cause is reached.

**Transition to Step 3:** When the root cause is identified (or the investigation reaches its limit), summarize findings and tell the user:

> "Root cause identified: [summary]. Would you like me to help remediate this, or is the investigation sufficient?"

If the user wants to proceed, move to Step 3. If they say "that's enough", stop.

---

### Step 3: Remediation

**Skill:** Read and follow `../remediation/SKILL.md`

**Goal:** Fix the issue using available tools, or escalate with full context if the fix requires actions outside the agent's capability.

**Transition to Step 4:** After remediation is complete (fix applied or escalation documented), offer prevention:

> "The issue has been [fixed/escalated]. The root cause was [X]. Want me to help add a monitor to detect this type of issue earlier next time?"

If the user says yes, move to Step 4. If no, the workflow is complete.

---

### Step 4: Prevent Recurrence (optional)

**Skill:** Read and follow `../monitoring-advisor/SKILL.md`

When loading monitoring-advisor for this step, frame the request as direct monitor creation — not coverage analysis. The user already knows what they want to monitor (the thing that just broke). Example framing:

> "Based on the incident, I recommend adding a [freshness/volume/validation] monitor on [table]. Let me create the monitor configuration."

**Goal:** Add or update a monitor to catch this class of issue in the future.

**Do not force this step.** It is optional — offer it after remediation, and respect if the user declines.

---

## Orchestration Rules

- **Users can enter at any step.** The entry point section above determines where to start.
- **Each step loads the actual skill's SKILL.md** via relative path. This workflow does not replicate skill logic — it sequences it.
- **Context carries forward** through conversation naturally. Alert IDs, table names, root cause findings from earlier steps are available to later steps without explicit state passing.
- **No state tracking or hooks.** This is purely prompt-driven sequencing.
- **User can exit anytime.** If they say "that's enough" or "stop", respect it immediately.
- **Do not skip back.** The workflow moves forward. If the user wants to re-investigate after remediation, they can start a new workflow or invoke a skill directly.
More from monte-carlo-data/mc-agent-toolkit