incident-runbook-templates

Name: incident-runbook-templates
Author: wshobson/agents

$npx mdskill add wshobson/agents/incident-runbook-templates

Generates structured incident response runbooks with procedures, escalation paths, and recovery actions

Solves the problem of inconsistent or incomplete incident response documentation
Uses templates for service outages, database issues, and escalation matrices
Applies severity-based logic to prioritize actions and response times
Delivers clear, step-by-step guides for engineers during critical incidents

SKILL.md

.github/skills/incident-runbook-templatesView on GitHub ↗

---
name: incident-runbook-templates
description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.
---

# Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

## When to Use This Skill

- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers

## Core Concepts

### 1. Incident Severity Levels

| Severity | Impact                     | Response Time     | Example                 |
| -------- | -------------------------- | ----------------- | ----------------------- |
| **SEV1** | Complete outage, data loss | 15 min            | Production down         |
| **SEV2** | Major degradation          | 30 min            | Critical feature broken |
| **SEV3** | Minor impact               | 2 hours           | Non-critical bug        |
| **SEV4** | Minimal impact             | Next business day | Cosmetic issue          |

### 2. Runbook Structure

```
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
```

## Detailed patterns and worked examples

Detailed pattern documentation lives in `references/details.md`. Read that file when the navigation tier above is insufficient.

## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Troubleshooting

### Runbook steps work in staging but fail during a real incident

Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:

```bash
# Step: Check pod status
kubectl get pods -n payments

# Prerequisites: kubectl configured, kubeconfig points to correct cluster
# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`
# Expected output: pods in Running state
```

### On-call engineer panics and skips steps out of order

Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:

```markdown
## Quick Checklist
- [ ] 1. Declare incident severity and open war room
- [ ] 2. Check service health (Section 4.1)
- [ ] 3. Check recent deployments (Section 4.1)
- [ ] 4. Roll back if deploy is suspect (Section 4.1)
- [ ] 5. Post initial notification to #payments-incidents
- [ ] 6. Escalate if > 15 min unresolved
```

### Runbook is outdated — commands reference old cluster names or endpoints

Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all `curl` endpoints and `kubectl` context names are still valid:

```markdown
## Runbook Metadata
| Field | Value |
|---|---|
| Last verified | 2024-11-15 |
| Owner | @platform-team |
| Review cadence | After every SEV1/SEV2 |
```

### Stakeholder communication is delayed while engineers are heads-down

Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:

```
Update every 15 minutes (even if no new information):
- Current status (Investigating / Mitigating / Monitoring)
- Impact (what is broken, who is affected, % of traffic)
- What we are doing right now
- Next update in: 15 minutes
```

### Database runbook commands cause additional downtime when run incorrectly

Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:

```sql
-- WARNING: This terminates active connections. Verify count first.
-- DRY RUN (check count before terminating):
SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

-- EXECUTE only after verifying count is reasonable (< 50):
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
```

## Related Skills

- `postmortem-writing` - After resolving an incident, use postmortem templates to capture root cause and preventive actions
- `on-call-handoff-patterns` - Structure shift handoffs so the incoming responder has full context on active incidents

More from wshobson/agents

Skill	Description
accessibility-compliance	Implement WCAG 2.2 compliant interfaces with mobile accessibility, inclusive design patterns, and assistive technology support. Use when auditing accessibility, implementing ARIA patterns, building for screen readers, or ensuring inclusive user experiences.
airflow-dag-patterns	Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.
angular-migration	Migrate from AngularJS to Angular using hybrid mode, incremental component rewriting, and dependency injection updates. Use when upgrading AngularJS applications, planning framework migrations, or modernizing legacy Angular code.
anti-reversing-techniques	Understand anti-reversing, obfuscation, and protection techniques encountered during software analysis. Use this skill when analyzing malware evasion techniques, when implementing anti-debugging protections for CTF challenges, when reverse engineering packed binaries, or when building security research tools that need to detect virtualized environments.
api-design-principles	Master REST and GraphQL API design principles to build intuitive, scalable, and maintainable APIs that delight developers. Use when designing new APIs, reviewing API specifications, or establishing API design standards.
architecture-decision-records	Write and maintain Architecture Decision Records (ADRs) following best practices for technical decision documentation. Use when documenting significant technical decisions, reviewing past architectural choices, or establishing decision processes.
architecture-patterns	Implement proven backend architecture patterns including Clean Architecture, Hexagonal Architecture, and Domain-Driven Design. Use this skill when designing clean architecture for a new microservice, when refactoring a monolith to use bounded contexts, when implementing hexagonal or onion architecture patterns, or when debugging dependency cycles between application layers.
async-python-patterns	Master Python asyncio, concurrent programming, and async/await patterns for high-performance applications. Use when building async APIs, concurrent systems, or I/O-bound applications requiring non-blocking operations.
attack-tree-construction	Build comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
auth-implementation-patterns	Master authentication and authorization patterns including JWT, OAuth2, session management, and RBAC to build secure, scalable access control systems. Use when implementing auth systems, securing APIs, or debugging security issues.