postmortem-writing
$
npx mdskill add wshobson/agents/postmortem-writingWrite blameless postmortems with root cause analysis and action items
- Solves the problem of documenting incidents without assigning blame
- Uses structured templates and root cause analysis frameworks
- Analyzes timelines, contributing factors, and system weaknesses
- Delivers clear, actionable reports to improve incident response
SKILL.md
.github/skills/postmortem-writingView on GitHub ↗
--- name: postmortem-writing description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes. --- # Postmortem Writing Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence. ## When to Use This Skill - Conducting post-incident reviews - Writing postmortem documents - Facilitating blameless postmortem meetings - Identifying root causes and contributing factors - Creating actionable follow-up items - Building organizational learning culture ## Core Concepts ### 1. Blameless Culture | Blame-Focused | Blameless | | ------------------------ | --------------------------------- | | "Who caused this?" | "What conditions allowed this?" | | "Someone made a mistake" | "The system allowed this mistake" | | Punish individuals | Improve systems | | Hide information | Share learnings | | Fear of speaking up | Psychological safety | ### 2. Postmortem Triggers - SEV1 or SEV2 incidents - Customer-facing outages > 15 minutes - Data loss or security incidents - Near-misses that could have been severe - Novel failure modes - Incidents requiring unusual intervention ## Quick Start ### Postmortem Timeline ``` Day 0: Incident occurs Day 1-2: Draft postmortem document Day 3-5: Postmortem meeting Day 5-7: Finalize document, create tickets Week 2+: Action item completion Quarterly: Review patterns across incidents ``` ## Templates and detailed worked examples Full template library and detailed worked examples live in `references/details.md`. Read that file when you need the concrete templates. ## References - [Connection Pool Best Practices](internal-wiki/connection-pools) - [Deployment Runbook](internal-wiki/deployment-runbook) ``` ### Template 2: 5 Whys Analysis ```markdown # 5 Whys Analysis: [Incident] ## Problem Statement Payment service experienced 47-minute outage due to database connection exhaustion. ## Analysis ### Why #1: Why did the service fail? **Answer**: Database connections were exhausted, causing all new requests to fail. **Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests. --- ### Why #2: Why were database connections exhausted? **Answer**: Each incoming request opened a new database connection instead of using the connection pool. **Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`. --- ### Why #3: Why did the code bypass the connection pool? **Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method. **Evidence**: PR #1234 shows the change, made while fixing a different bug. --- ### Why #4: Why wasn't this caught in code review? **Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change. **Evidence**: Review comments only discuss business logic. --- ### Why #5: Why isn't there a safety net for this type of change? **Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns. **Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections. ## Root Causes Identified 1. **Primary**: Missing automated tests for infrastructure behavior 2. **Secondary**: Insufficient documentation of architectural patterns 3. **Tertiary**: Code review checklist doesn't include infrastructure considerations ## Systemic Improvements | Root Cause | Improvement | Type | | ------------- | --------------------------------- | ---------- | | Missing tests | Add infrastructure behavior tests | Prevention | | Missing docs | Document connection patterns | Prevention | | Review gaps | Update review checklist | Detection | | No canary | Implement canary deployments | Mitigation | ``` ### Template 3: Quick Postmortem (Minor Incidents) ```markdown # Quick Postmortem: [Brief Title] **Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3 ## What Happened API latency spiked to 5s due to cache miss storm after cache flush. ## Timeline - 10:00 - Cache flush initiated for config update - 10:02 - Latency alerts fire - 10:05 - Identified as cache miss storm - 10:08 - Enabled cache warming - 10:12 - Latency normalized ## Root Cause Full cache flush for minor config update caused thundering herd. ## Fix - Immediate: Enabled cache warming - Long-term: Implement partial cache invalidation (ENG-999) ## Lessons Don't full-flush cache in production; use targeted invalidation. ``` ## Facilitation Guide ### Running a Postmortem Meeting ```markdown ## Meeting Structure (60 minutes) ### 1. Opening (5 min) - Remind everyone of blameless culture - "We're here to learn, not to blame" - Review meeting norms ### 2. Timeline Review (15 min) - Walk through events chronologically - Ask clarifying questions - Identify gaps in timeline ### 3. Analysis Discussion (20 min) - What failed? - Why did it fail? - What conditions allowed this? - What would have prevented it? ### 4. Action Items (15 min) - Brainstorm improvements - Prioritize by impact and effort - Assign owners and due dates ### 5. Closing (5 min) - Summarize key learnings - Confirm action item owners - Schedule follow-up if needed ## Facilitation Tips - Keep discussion on track - Redirect blame to systems - Encourage quiet participants - Document dissenting views - Time-box tangents ``` ## Anti-Patterns to Avoid | Anti-Pattern | Problem | Better Approach | | ----------------------- | -------------------------- | ------------------------------- | | **Blame game** | Shuts down learning | Focus on systems | | **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times | | **No action items** | Waste of time | Always have concrete next steps | | **Unrealistic actions** | Never completed | Scope to achievable tasks | | **No follow-up** | Actions forgotten | Track in ticketing system | ## Best Practices ### Do's - **Start immediately** - Memory fades fast - **Be specific** - Exact times, exact errors - **Include graphs** - Visual evidence - **Assign owners** - No orphan action items - **Share widely** - Organizational learning ### Don'ts - **Don't name and shame** - Ever - **Don't skip small incidents** - They reveal patterns - **Don't make it a blame doc** - That kills learning - **Don't create busywork** - Actions should be meaningful - **Don't skip follow-up** - Verify actions completed
More from wshobson/agents
- accessibility-complianceImplement WCAG 2.2 compliant interfaces with mobile accessibility, inclusive design patterns, and assistive technology support. Use when auditing accessibility, implementing ARIA patterns, building for screen readers, or ensuring inclusive user experiences.
- airflow-dag-patternsBuild production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.
- angular-migrationMigrate from AngularJS to Angular using hybrid mode, incremental component rewriting, and dependency injection updates. Use when upgrading AngularJS applications, planning framework migrations, or modernizing legacy Angular code.
- anti-reversing-techniquesUnderstand anti-reversing, obfuscation, and protection techniques encountered during software analysis. Use this skill when analyzing malware evasion techniques, when implementing anti-debugging protections for CTF challenges, when reverse engineering packed binaries, or when building security research tools that need to detect virtualized environments.
- api-design-principlesMaster REST and GraphQL API design principles to build intuitive, scalable, and maintainable APIs that delight developers. Use when designing new APIs, reviewing API specifications, or establishing API design standards.
- architecture-decision-recordsWrite and maintain Architecture Decision Records (ADRs) following best practices for technical decision documentation. Use when documenting significant technical decisions, reviewing past architectural choices, or establishing decision processes.
- architecture-patternsImplement proven backend architecture patterns including Clean Architecture, Hexagonal Architecture, and Domain-Driven Design. Use this skill when designing clean architecture for a new microservice, when refactoring a monolith to use bounded contexts, when implementing hexagonal or onion architecture patterns, or when debugging dependency cycles between application layers.
- async-python-patternsMaster Python asyncio, concurrent programming, and async/await patterns for high-performance applications. Use when building async APIs, concurrent systems, or I/O-bound applications requiring non-blocking operations.
- attack-tree-constructionBuild comprehensive attack trees to visualize threat paths. Use when mapping attack scenarios, identifying defense gaps, or communicating security risks to stakeholders.
- auth-implementation-patternsMaster authentication and authorization patterns including JWT, OAuth2, session management, and RBAC to build secure, scalable access control systems. Use when implementing auth systems, securing APIs, or debugging security issues.