Postmortem Writing
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
When to Use This Skill
- Conducting post-incident reviews
- Writing postmortem documents
- Facilitating blameless postmortem meetings
- Identifying root causes and contributing factors
- Creating actionable follow-up items
- Building organizational learning culture
Core Concepts
1. Blameless Culture
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
2. Postmortem Triggers
- SEV1 or SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
- Incidents requiring unusual intervention
Quick Start
Postmortem Timeline
Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents
Templates and detailed worked examples
Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.
References
### Template 2: 5 Whys Analysis
5 Whys Analysis: [Incident]
Problem Statement
Payment service experienced 47-minute outage due to database connection exhaustion.
Analysis
Why #1: Why did the service fail?
Answer: Database connections were exhausted, causing all new requests to fail.
Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
---
Why #2: Why were database connections exhausted?
Answer: Each incoming request opened a new database connection instead of using the connection pool.
Evidence: Code diff shows direct DriverManager.getConnection() instead of pooled DataSource.
---
Why #3: Why did the code bypass the connection pool?
Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.
Evidence: PR #1234 shows the change, made while fixing a different bug.
---
Why #4: Why wasn't this caught in code review?
Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
Evidence: Review comments only discuss business logic.
---
Why #5: Why isn't there a safety net for this type of change?
Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.
Root Causes Identified
- Primary: Missing automated tests for infrastructure behavior
- Secondary: Insufficient documentation of architectural patterns
- Tertiary: Code review checklist doesn't include infrastructure considerations
Systemic Improvements
| Root Cause | Improvement | Type |
|---|---|---|
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
### Template 3: Quick Postmortem (Minor Incidents)
Quick Postmortem: [Brief Title]
Date: 2024-01-15 | Duration: 12 min | Severity: SEV3
What Happened
API latency spiked to 5s due to cache miss storm after cache flush.
Timeline
- 10:00 - Cache flush initiated for config update
- 10:02 - Latency alerts fire
- 10:05 - Identified as cache miss storm
- 10:08 - Enabled cache warming
- 10:12 - Latency normalized
Root Cause
Full cache flush for minor config update caused thundering herd.
Fix
- Immediate: Enabled cache warming
- Long-term: Implement partial cache invalidation (ENG-999)
Lessons
Don't full-flush cache in production; use targeted invalidation.
## Facilitation Guide
### Running a Postmortem Meeting
Meeting Structure (60 minutes)
1. Opening (5 min)
- Remind everyone of blameless culture
- "We're here to learn, not to blame"
- Review meeting norms
2. Timeline Review (15 min)
- Walk through events chronologically
- Ask clarifying questions
- Identify gaps in timeline
3. Analysis Discussion (20 min)
- What failed?
- Why did it fail?
- What conditions allowed this?
- What would have prevented it?
4. Action Items (15 min)
- Brainstorm improvements
- Prioritize by impact and effort
- Assign owners and due dates
5. Closing (5 min)
- Summarize key learnings
- Confirm action item owners
- Schedule follow-up if needed
Facilitation Tips
- Keep discussion on track
- Redirect blame to systems
- Encourage quiet participants
- Document dissenting views
- Time-box tangents
## Anti-Patterns to Avoid
| Anti-Pattern | Problem | Better Approach |
| ----------------------- | -------------------------- | ------------------------------- |
| **Blame game** | Shuts down learning | Focus on systems |
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
| **No action items** | Waste of time | Always have concrete next steps |
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
| **No follow-up** | Actions forgotten | Track in ticketing system |
## Best Practices
### Do's
- **Start immediately** - Memory fades fast
- **Be specific** - Exact times, exact errors
- **Include graphs** - Visual evidence
- **Assign owners** - No orphan action items
- **Share widely** - Organizational learning
### Don'ts
- **Don't name and shame** - Ever
- **Don't skip small incidents** - They reveal patterns
- **Don't make it a blame doc** - That kills learning
- **Don't create busywork** - Actions should be meaningful
- **Don't skip follow-up** - Verify actions completed
