runbook-incident-response
$
npx mdskill add elophanto/EloPhanto/runbook-incident-responseExecute severity-based incident response from detection to post-mortem.
- Resolve production outages, security breaches, and data loss incidents.
- Integrates with organization_spawn for infrastructure and devops automation.
- Classifies urgency using P0 through P3 severity definitions and examples.
- Delivers structured recovery plans and post-mortem reports to teams.
SKILL.md
.github/skills/runbook-incident-responseView on GitHub ↗
--- name: runbook-incident-response description: Incident response runbook — detection through post-mortem for production issues with severity-based response teams. Adapted from msitarzewski/agency-agents. --- ## Triggers - incident response - production down - service outage - system failure - security breach - data loss - performance degradation - error rate spike - P0 incident - P1 incident - rollback needed - hotfix deployment - post-mortem - root cause analysis - incident triage - on-call alert - system recovery ## Instructions Something is broken in production. Users are affected. Speed of response matters, but so does doing it right. This runbook covers detection through post-mortem. Duration: minutes to hours. ### Severity Classification | Level | Definition | Examples | Response Time | |-------|-----------|----------|--------------| | P0 Critical | Service completely down, data loss, security breach | Database corruption, DDoS, auth failure | Immediate | | P1 High | Major feature broken, significant degradation | Payment down, 50%+ error rate, 10x latency | < 1 hour | | P2 Medium | Minor feature broken, workaround available | Search not working, non-critical API errors | < 4 hours | | P3 Low | Cosmetic issue, minor inconvenience | Styling bug, typo, minor UI glitch | Next sprint | ### Response Teams by Severity **P0 Critical** (use `organization_spawn` with all): - Infrastructure Maintainer: Incident commander — assess scope, coordinate - DevOps Automator: Deployment/rollback execution - Backend Architect: Root cause investigation (system) - Frontend Developer: Client-side investigation - Support Responder: Status page updates, user notifications - Executive Summary Generator: Real-time executive updates **P1 High**: - Infrastructure Maintainer: Incident commander - DevOps Automator: Deployment support - Relevant Developer Agent: Fix implementation - Support Responder: User communication **P2 Medium**: - Relevant Developer Agent: Fix implementation - Evidence Collector: Verify fix **P3 Low**: - Sprint Prioritizer: Add to backlog ### Step 1: Detection & Triage (0-5 minutes) Trigger: Alert from monitoring / User report / Agent detection Infrastructure Maintainer: 1. Acknowledge alert 2. Assess scope and impact (users affected, services impacted, data at risk?) 3. Classify severity (P0/P1/P2/P3) 4. Use `organization_spawn` to activate appropriate response team 5. Create incident channel/thread ### Step 2: Investigation (5-30 minutes) Parallel investigation via `organization_delegate`: - Infrastructure Maintainer: Check system metrics (CPU, memory, network, disk), review error logs, check recent deployments, verify external dependencies - Backend Architect (P0/P1): Check database health, review API error rates, check service communication, identify failing component - DevOps Automator: Review deployment history, check CI/CD status, prepare rollback, verify infrastructure state Output: Root cause identified or narrowed to component. ### Step 3: Mitigation (15-60 minutes) Decision tree: - **Caused by recent deployment**: DevOps Automator executes rollback, Infrastructure Maintainer verifies recovery - **Caused by infrastructure issue**: Infrastructure Maintainer scales/restarts/failovers, verify recovery - **Caused by code bug**: Developer implements hotfix, Evidence Collector verifies, DevOps Automator deploys hotfix - **Caused by external dependency**: Infrastructure Maintainer activates fallback/cache, Support Responder communicates to users Throughout: - Support Responder: Update status page every 15 minutes - Executive Summary Generator: Brief stakeholders (P0 only) ### Step 4: Resolution Verification (Post-fix) - Evidence Collector: Verify fix resolves issue, screenshot evidence, confirm no new issues - Infrastructure Maintainer: Verify metrics returning to normal, confirm no cascading failures, monitor 30 minutes post-fix - API Tester (if API-related): Run regression on affected endpoints, verify response times, confirm error rates at baseline ### Step 5: Post-Mortem (Within 48 hours) Use `organization_delegate` to Workflow Optimizer: 1. Timeline reconstruction (when introduced, detected, resolved, total impact duration) 2. Root cause analysis (what failed, why, why not caught earlier, 5 Whys) 3. Impact assessment (users affected, revenue impact, reputation, data impact) 4. Prevention measures (monitoring improvements, testing improvements, process changes, infrastructure changes) 5. Action items with owners and deadlines Use `knowledge_write` to persist post-mortem report. Sprint Prioritizer adds prevention tasks to backlog. ### Communication Templates **Status Page Update**: ``` [TIMESTAMP] — [SERVICE NAME] Incident Status: [Investigating / Identified / Monitoring / Resolved] Impact: [Description of user impact] Current action: [What we're doing] Next update: [When to expect next update] ``` **Executive Update (P0 only)**: ``` INCIDENT BRIEF — [TIMESTAMP] SITUATION: [Service] is [down/degraded] affecting [N users/% of traffic] CAUSE: [Known/Under investigation] — [Brief description if known] ACTION: [What's being done] — ETA [time estimate] IMPACT: [Business impact — revenue, users, reputation] NEXT UPDATE: [Timestamp] ``` ### Escalation Matrix | Condition | Escalate To | |-----------|------------| | P0 not resolved in 30 min | Studio Producer (additional resources, vendor escalation) | | P1 not resolved in 2 hours | Project Shepherd (resource reallocation) | | Data breach suspected | Legal Compliance Checker (regulatory notification) | | User data affected | Legal Compliance Checker + Executive Summary Generator (GDPR/CCPA) | | Revenue impact > threshold | Finance Tracker + Studio Producer (business impact assessment) | ## Deliverables - [ ] Incident classified with severity level - [ ] Response team activated within SLA - [ ] Root cause identified - [ ] Fix implemented and verified - [ ] Status page updated throughout - [ ] Stakeholders briefed (P0/P1) - [ ] Post-mortem completed within 48 hours - [ ] Prevention action items in backlog ## Success Metrics - P0 detection to resolution: < 30 minutes - P1 detection to resolution: < 2 hours - P2 detection to resolution: < 4 hours - Post-mortem completion rate: 100% for P0/P1 - Repeat incident rate: < 5% - Status page update frequency during incident: every 15 minutes - Mean time to detect (MTTD): < 5 minutes - Mean time to resolve (MTTR): < 30 minutes ## Verify - The deploy command was actually run and the build/log output (or deploy URL) is captured - The deployed URL was opened and returned a 2xx; key routes were sampled, not just the index - Environment variables required by the app are present in the target environment; missing-var failures were ruled out - A rollback plan (previous deployment ID, git SHA, or one-line revert command) is documented before promoting to production - Health/observability check (logs, error tracker, status page) was inspected post-deploy; baseline error rate is recorded - DNS / domain / SSL configuration was confirmed, not assumed to carry over from previous deploys
More from elophanto/EloPhanto
- 12-principles-of-animationAudit animation code against Disney's 12 principles adapted for web. Use when reviewing motion, implementing animations, or checking animation quality. Outputs file:line findings.
- accessibility-auditingAudit interfaces against WCAG 2.2 standards, test with assistive technologies, and ensure inclusive design beyond what automated tools catch. Adapted from msitarzewski/agency-agents.
- agency-phase-0-discoveryIntelligence and discovery phase — validate opportunity before committing resources. Adapted from msitarzewski/agency-agents.
- agency-phase-1-strategyStrategy and architecture phase — define what to build, how to structure it, and what success looks like. Adapted from msitarzewski/agency-agents.
- agency-phase-2-foundationFoundation and scaffolding phase — build technical and operational foundation before feature development. Adapted from msitarzewski/agency-agents.
- agency-phase-3-buildBuild and iterate phase — implement all features through continuous Dev-QA loops with orchestrated multi-agent sprints. Adapted from msitarzewski/agency-agents.
- agency-phase-4-hardeningQuality and hardening phase — the final quality gauntlet proving production readiness with evidence. Adapted from msitarzewski/agency-agents.
- agency-phase-5-launchLaunch and growth phase — coordinate go-to-market execution across all channels for maximum impact. Adapted from msitarzewski/agency-agents.
- agency-phase-6-operateOperate and evolve phase — sustained operations with continuous improvement for live products. Adapted from msitarzewski/agency-agents.
- agency-strategyNEXUS multi-agent orchestration strategy — the complete operational playbook for coordinating specialized AI agents across project phases. Adapted from msitarzewski/agency-agents.