Operations
Incident postmortem
A blameless written analysis of an operational incident (outage, security event, major bug) that identifies root cause, impact, response timeline, and follow-up actions — without assigning individual blame.
In plain English
After something breaks, the team writes up what happened, why, what they did, and what they'll change. The goal is system improvement, not finding someone to blame.
Example
Incident: 3-hour database outage at 2am UTC. Postmortem documents: timeline (alert at 2:07 → on-call paged at 2:10 → root cause identified at 3:30 → fix deployed at 4:45), root cause (deployment overran connection pool), impact (~1,200 users affected, 0 data loss), and action items (cap connection pool, alert on pool saturation, add deployment safety check). No mention of 'whose fault' it was.
Why it matters
Postmortems convert incidents into system improvements. Teams that do them well learn from every outage; teams that don't repeat the same mistakes. Blameless culture matters: if people fear blame, they hide near-misses, and you lose 90% of the data you'd otherwise learn from.
Common mistakes
- Skipping postmortems for 'small' incidents — small incidents are warnings; ignored ones become big incidents
- Letting postmortems drag past 1 week from the incident — context decays fast
- Writing them privately as a defensive document — the whole point is shared learning
- Listing action items without owners or dates — they don't get done
- Using postmortems for performance reviews — destroys blameless culture immediately