Incidents happen. Systems fail, deploys go wrong, dependencies break. What separates mature engineering organizations isn’t fewer incidents—it’s how they respond. Good incident management minimizes blast radius, speeds recovery, and turns failures into learning.
Here’s how to build effective incident management.
Incident Lifecycle
The Flow
Detection ──► Triage ──► Response ──► Resolution ──► Postmortem
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Alert Severity Mitigate Confirm Learn
Report Assign Diagnose Verify Improve
Notice Escalate Fix Communicate Prevent
Detection
Alerting Philosophy
alerting_principles:
alert_on_symptoms:
good: "Error rate > 5%"
bad: "CPU > 80%"
why: Users feel symptoms, not causes
actionable:
good: "Database connection pool exhausted"
bad: "Something might be wrong"
why: Alerts need clear response
appropriate_urgency:
page: User-facing impact, needs immediate attention
ticket: Degradation that can wait hours
log: Information for later analysis
avoid_alert_fatigue:
- No "just in case" alerts
- Review and tune regularly
- Every page should matter
Detection Sources
detection_sources:
automated:
- Monitoring alerts
- Anomaly detection
- Synthetic monitoring
- Error tracking (Sentry, etc.)
human:
- Customer reports
- Internal reports
- Social media
- Support tickets
proactive:
- Chaos engineering
- Load testing
- Security scanning
- Capacity monitoring
Triage
Severity Classification
severity_levels:
sev1_critical:
criteria:
- Complete service outage
- Data loss or corruption
- Security breach
- Revenue-impacting
response:
- All hands on deck
- Executive notification
- War room activated
- Public communication
target_response: 5 minutes
target_resolution: 1 hour
sev2_major:
criteria:
- Partial outage
- Significant degradation
- Workaround exists but painful
response:
- On-call responds
- Team escalation if needed
- Status page update
target_response: 15 minutes
target_resolution: 4 hours
sev3_minor:
criteria:
- Minor feature impact
- Easy workaround available
- Small user subset
response:
- On-call investigates
- Fix in normal workflow
target_response: 1 hour
target_resolution: 24 hours
sev4_low:
criteria:
- Cosmetic issues
- No user impact
- Can wait for normal process
response:
- Create ticket
- Prioritize with normal work
target_response: Next business day
Initial Assessment
triage_questions:
impact:
- How many users affected?
- Which features impacted?
- Is there data impact?
- What's the business impact?
scope:
- Is it getting worse?
- Is it contained?
- What services involved?
urgency:
- Can it wait?
- Is there a workaround?
- What's the trend?
Response
Incident Command
incident_roles:
incident_commander:
responsibilities:
- Owns incident resolution
- Coordinates response
- Makes decisions
- Escalates when needed
not_responsible_for:
- Fixing the problem directly
- Writing the postmortem
communications_lead:
responsibilities:
- Status page updates
- Internal communication
- Customer communication
- Stakeholder updates
technical_lead:
responsibilities:
- Leads technical investigation
- Coordinates technical responders
- Proposes solutions
- Implements fixes
scribe:
responsibilities:
- Documents timeline
- Records decisions
- Captures key information
- Maintains incident channel
Communication
communication_templates:
initial_notice:
internal: |
🔴 INCIDENT: [Brief description]
Severity: [SEV1/2/3]
Impact: [User-facing impact]
Status: Investigating
IC: @name
Channel: #incident-[number]
status_update:
interval: Every 30 minutes minimum for SEV1/2
template: |
Update: [What we know]
Actions: [What we're doing]
ETA: [When we expect resolution, if known]
Next update: [Time]
customer_facing:
avoid:
- Technical jargon
- Blame
- Speculation
include:
- What's happening
- Impact on customers
- What we're doing
- When to expect update
Mitigation Strategies
mitigation_priority:
1_stop_bleeding:
- Rollback deployment
- Disable feature flag
- Scale up
- Failover
- Block bad traffic
2_restore_service:
- Restart services
- Clear queues
- Restore from backup
- Switch to degraded mode
3_root_cause:
- Only after stable
- Proper investigation
- No rush to blame
Runbooks
# Example runbook
runbook: database_connection_exhaustion
symptoms:
- "Connection pool exhausted" errors
- Request timeouts
- Slow response times
investigation:
- Check: pg_stat_activity for connections
- Check: Application connection pool metrics
- Check: Recent deployments
- Check: Traffic spike
mitigation:
quick:
- Increase connection pool size (if headroom)
- Restart application pods
- Kill idle connections: SELECT pg_terminate_backend(pid)...
longer_term:
- Identify connection leak
- Review connection pool configuration
- Consider PgBouncer
escalation:
- If database unresponsive: Page DBA
- If application-side: Page service owner
references:
- Link to monitoring dashboard
- Link to database documentation
Resolution
Confirming Resolution
resolution_checklist:
verify:
- [ ] Error rates back to normal
- [ ] User-facing metrics recovered
- [ ] No new related errors
- [ ] Synthetic checks passing
communicate:
- [ ] Update status page
- [ ] Notify stakeholders
- [ ] Close incident channel (after delay)
preserve:
- [ ] Capture timeline
- [ ] Save relevant logs/metrics
- [ ] Note key decisions
- [ ] Schedule postmortem
Postmortem
Blameless Culture
blameless_postmortem:
principles:
- People are not the root cause
- Systems allow failures
- Humans make mistakes—systems should prevent/catch
- Focus on learning, not blame
language:
avoid: "John should have checked..."
prefer: "The deployment process didn't verify..."
avoid: "The engineer made an error"
prefer: "The system accepted invalid input"
Postmortem Template
postmortem:
title: "[Date] - Brief description"
summary:
duration: Start to resolution time
impact: Users/revenue affected
severity: SEV level
root_cause: One-line summary
timeline:
- "10:00 - Deployment started"
- "10:15 - First alerts fired"
- "10:20 - IC assigned"
- "10:45 - Root cause identified"
- "11:00 - Rollback completed"
- "11:15 - Service recovered"
root_cause:
what_happened: Technical explanation
contributing_factors:
- Factor 1
- Factor 2
detection:
how_detected: Alert/report/etc
could_detect_sooner: Yes/No, how
response:
what_worked: Good parts of response
what_didnt: Areas for improvement
response_time: Was it appropriate?
action_items:
- item: "Add validation for X"
owner: "@name"
priority: P1
due: "2 weeks"
- item: "Improve alert for Y"
owner: "@name"
priority: P2
due: "1 month"
lessons_learned:
- Lesson 1
- Lesson 2
Metrics
Incident Metrics
metrics_to_track:
detection:
- Time to detect (TTD)
- Detection source breakdown
response:
- Time to respond (TTR)
- Time to mitigate
- Time to resolve
impact:
- Downtime minutes
- Users affected
- Revenue impact
- SLO impact
trends:
- Incidents per week/month
- Severity distribution
- Repeat incidents
- Action item completion rate
Key Takeaways
- Alert on symptoms, not causes; every page should be actionable
- Clear severity levels enable appropriate response
- Defined roles (IC, tech lead, comms) prevent chaos
- Communicate early and often, internally and externally
- Mitigate first, investigate root cause after stability
- Runbooks reduce cognitive load during incidents
- Blameless postmortems focus on systems, not people
- Action items with owners and deadlines drive improvement
- Track metrics to identify patterns and measure improvement
- Regular incident reviews build organizational resilience
Incidents are inevitable. How you handle them defines your reliability culture.