When you’re a five-person startup, incident response is simple: everyone knows about problems immediately, someone fixes them, and you move on. There’s no process because there doesn’t need to be.
At fifty people with multiple teams and services, this informality breaks down. People don’t know who’s responsible for what. Communication during incidents is chaotic. Knowledge from incidents isn’t captured. The same issues recur.
Scaling incident management requires intentional process without bureaucratic overhead. Here’s how to build it.
Incident Response Framework
Clear Roles
During incidents, ambiguity costs time. Define roles clearly:
Incident Commander (IC): Owns the incident. Makes decisions. Coordinates responders. Communicates status. Doesn’t necessarily fix things—coordinates those who do.
Technical Lead: Drives technical investigation and remediation. May delegate to specialists.
Communications Lead: Handles stakeholder communication. Updates status pages. Drafts customer notifications.
Scribe: Documents timeline, decisions, and actions. Creates record for postmortem.
For smaller incidents, one person may fill multiple roles. For major incidents, dedicated people in each role prevent context switching.
Severity Levels
Not all incidents are equal. Define severity levels with clear criteria:
SEV-1 (Critical):
- Complete outage
- Data loss or corruption
- Security breach
- Major customer impact
Response: All hands, immediate escalation, external communication.
SEV-2 (High):
- Partial outage
- Significant degradation
- Major feature unavailable
Response: On-call team, escalation to relevant teams, status page update.
SEV-3 (Medium):
- Minor degradation
- Non-critical feature impact
- Affecting subset of users
Response: On-call handles, escalate if needed.
SEV-4 (Low):
- Minimal impact
- Cosmetic issues
- Internal tools
Response: Normal workflow, fix when convenient.
Clear severity levels ensure appropriate response. A SEV-1 shouldn’t get SEV-4 treatment, and SEV-4 shouldn’t trigger SEV-1 panic.
Communication Channels
Define where incident communication happens:
Incident channel: Dedicated chat channel per incident. All responders join. Technical discussion happens here.
Status page: External communication. Customers see current status and updates.
Internal updates: Regular updates to leadership and stakeholders. Separate from technical channel.
Separation prevents noise from reaching those who don’t need it and ensures relevant parties stay informed.
On-Call Structure
Rotation Design
Design on-call rotations that are sustainable:
Coverage: 24/7 coverage requires either follow-the-sun (teams in different time zones) or rotation (each person covers nights periodically).
Duration: Week-long rotations are common. Shorter rotations (2-3 days) reduce individual burden but increase handoff frequency.
Handoff: Explicit handoff between rotation shifts. Outgoing on-call briefs incoming on current issues.
Backup: Secondary on-call for escalation or if primary is unavailable.
On-Call Expectations
Set clear expectations:
- Response time (e.g., acknowledge within 5 minutes, start working within 15)
- Escalation criteria (when to wake others up)
- Compensation (time off, additional pay)
- Support (laptop, reliable internet, phone)
Unclear expectations create frustration and inconsistent response.
Preventing Burnout
On-call is stressful. Prevent burnout:
- Limit on-call frequency (not more than every 4-6 weeks)
- Compensate appropriately
- Fix noisy alerts that wake people unnecessarily
- Review on-call load regularly
- Allow trading shifts when needed
Burned-out engineers make poor decisions during incidents.
During the Incident
Initial Response
When an incident is detected:
- Acknowledge: Confirm you’re responding.
- Assess severity: Determine severity level.
- Open incident channel: Create communication channel.
- Page relevant people: If severity warrants.
- Start timeline: Document what you know.
Investigation
Follow the symptoms: Start with what users are experiencing. Work backward to causes.
Check recent changes: Deployments, config changes, traffic patterns. Correlation isn’t causation, but changes are prime suspects.
Use observability: Metrics, logs, traces. Let data guide investigation.
Communicate progress: Regular updates even when “still investigating.”
Mitigation
Fix the problem or reduce impact:
Rollback: If recent change caused it, roll back.
Restart: Sometimes services just need restart.
Scale: If overwhelmed, add capacity.
Disable: Turn off problematic feature.
Redirect: Route traffic away from affected components.
Mitigation first, root cause later. Stop the bleeding before surgery.
Resolution
Confirm the incident is resolved:
- User-facing impact eliminated
- Systems stable
- Monitoring shows normal behavior
Update status page. Close the incident channel (but preserve history).
After the Incident
Postmortem
Every significant incident deserves a postmortem. The goal: learn and improve, not blame.
Postmortem template:
# Incident Postmortem: [Title]
## Summary
[One paragraph summary of what happened]
## Timeline
[Chronological list of events]
## Impact
[What was affected, for how long, how many users]
## Root Cause
[Why did this happen?]
## Contributing Factors
[What made it worse or detection slower?]
## Action Items
[What will we do to prevent recurrence?]
## Lessons Learned
[What did we learn?]
Blameless Culture
Postmortems must be blameless. If people fear blame, they hide information. Hidden information prevents learning.
Focus on systems, not individuals:
- “The deployment process allowed an untested change” not “Alice deployed without testing”
- “Alerting didn’t detect the issue” not “Nobody noticed the problem”
When humans make mistakes, ask what system allowed the mistake.
Action Item Follow-Through
Postmortems without action item follow-through are theater. Track action items to completion:
- Assign owners
- Set deadlines
- Review in regular meetings
- Close when done
Incomplete action items mean the same incident will recur.
Trend Analysis
Individual incidents matter. Patterns across incidents matter more.
Track:
- Incident frequency by service, type, cause
- Time to detect, time to resolve
- On-call burden by team
Patterns reveal systemic issues. A service with frequent incidents needs investment. A team with high on-call burden needs support.
Tools
Alerting: PagerDuty, Opsgenie, VictorOps for on-call management and escalation.
Communication: Slack, Microsoft Teams with dedicated incident channels.
Status pages: Statuspage.io, Atlassian Statuspage, Cachet for external communication.
Documentation: Confluence, Notion, Google Docs for postmortems and runbooks.
Incident tracking: Jira, custom tools for tracking incidents and action items.
Choose tools that integrate with your workflow. The best tool is the one people actually use.
Key Takeaways
- Define clear roles (IC, tech lead, communications, scribe) to eliminate ambiguity
- Severity levels ensure appropriate response proportional to impact
- On-call rotations must be sustainable to prevent burnout
- Mitigate first, investigate root cause after impact is contained
- Blameless postmortems focus on systems, not individuals
- Track and complete action items; incomplete action items mean repeat incidents
- Analyze trends across incidents to identify systemic issues