Operations

Incident postmortem

A blameless written analysis of an operational incident (outage, security event, major bug) that identifies root cause, impact, response timeline, and follow-up actions — without assigning individual blame.

By Priya Ranganathan · Last updated June 16, 2026

In plain English

After something breaks, the team writes up what happened, why, what they did, and what they'll change. The goal is system improvement, not finding someone to blame.

Example

Incident: 3-hour database outage at 2am UTC. Postmortem documents: timeline (alert at 2:07 → on-call paged at 2:10 → root cause identified at 3:30 → fix deployed at 4:45), root cause (deployment overran connection pool), impact (~1,200 users affected, 0 data loss), and action items (cap connection pool, alert on pool saturation, add deployment safety check). No mention of 'whose fault' it was.

Why it matters

Postmortems convert incidents into system improvements. Teams that do them well learn from every outage; teams that don't repeat the same mistakes. Blameless culture matters: if people fear blame, they hide near-misses, and you lose 90% of the data you'd otherwise learn from.

Common mistakes

Skipping postmortems for 'small' incidents — small incidents are warnings; ignored ones become big incidents
Letting postmortems drag past 1 week from the incident — context decays fast
Writing them privately as a defensive document — the whole point is shared learning
Listing action items without owners or dates — they don't get done
Using postmortems for performance reviews — destroys blameless culture immediately

SLA

KPI

Example

Why it matters

Common mistakes

Related