Incident Management for Growing Teams

When you’re a five-person startup, incident response is simple: everyone knows about problems immediately, someone fixes them, and you move on. There’s no process because there doesn’t need to be.

At fifty people with multiple teams and services, this informality breaks down. People don’t know who’s responsible for what. Communication during incidents is chaotic. Knowledge from incidents isn’t captured. The same issues recur.

Scaling incident management requires intentional process without bureaucratic overhead. Here’s how to build it.

Incident Response Framework

Clear Roles

During incidents, ambiguity costs time. Define roles clearly:

Incident Commander (IC): Owns the incident. Makes decisions. Coordinates responders. Communicates status. Doesn’t necessarily fix things—coordinates those who do.

Technical Lead: Drives technical investigation and remediation. May delegate to specialists.

Communications Lead: Handles stakeholder communication. Updates status pages. Drafts customer notifications.

Scribe: Documents timeline, decisions, and actions. Creates record for postmortem.

For smaller incidents, one person may fill multiple roles. For major incidents, dedicated people in each role prevent context switching.

Severity Levels

Not all incidents are equal. Define severity levels with clear criteria:

SEV-1 (Critical):

Complete outage
Data loss or corruption
Security breach
Major customer impact

Response: All hands, immediate escalation, external communication.

SEV-2 (High):

Partial outage
Significant degradation
Major feature unavailable

Response: On-call team, escalation to relevant teams, status page update.

SEV-3 (Medium):

Minor degradation
Non-critical feature impact
Affecting subset of users

Response: On-call handles, escalate if needed.

SEV-4 (Low):

Minimal impact
Cosmetic issues
Internal tools

Response: Normal workflow, fix when convenient.

Clear severity levels ensure appropriate response. A SEV-1 shouldn’t get SEV-4 treatment, and SEV-4 shouldn’t trigger SEV-1 panic.

Communication Channels

Define where incident communication happens:

Incident channel: Dedicated chat channel per incident. All responders join. Technical discussion happens here.

Status page: External communication. Customers see current status and updates.

Internal updates: Regular updates to leadership and stakeholders. Separate from technical channel.

Separation prevents noise from reaching those who don’t need it and ensures relevant parties stay informed.

On-Call Structure

Rotation Design

Design on-call rotations that are sustainable:

Coverage: 24/7 coverage requires either follow-the-sun (teams in different time zones) or rotation (each person covers nights periodically).

Duration: Week-long rotations are common. Shorter rotations (2-3 days) reduce individual burden but increase handoff frequency.

Handoff: Explicit handoff between rotation shifts. Outgoing on-call briefs incoming on current issues.

Backup: Secondary on-call for escalation or if primary is unavailable.

On-Call Expectations

Set clear expectations:

Response time (e.g., acknowledge within 5 minutes, start working within 15)
Escalation criteria (when to wake others up)
Compensation (time off, additional pay)
Support (laptop, reliable internet, phone)

Unclear expectations create frustration and inconsistent response.

Preventing Burnout

On-call is stressful. Prevent burnout:

Limit on-call frequency (not more than every 4-6 weeks)
Compensate appropriately
Fix noisy alerts that wake people unnecessarily
Review on-call load regularly
Allow trading shifts when needed

Burned-out engineers make poor decisions during incidents.

During the Incident

Initial Response

When an incident is detected:

Acknowledge: Confirm you’re responding.
Assess severity: Determine severity level.
Open incident channel: Create communication channel.
Page relevant people: If severity warrants.
Start timeline: Document what you know.

Investigation

Follow the symptoms: Start with what users are experiencing. Work backward to causes.

Check recent changes: Deployments, config changes, traffic patterns. Correlation isn’t causation, but changes are prime suspects.

Use observability: Metrics, logs, traces. Let data guide investigation.

Communicate progress: Regular updates even when “still investigating.”

Mitigation

Fix the problem or reduce impact:

Rollback: If recent change caused it, roll back.

Restart: Sometimes services just need restart.

Scale: If overwhelmed, add capacity.

Disable: Turn off problematic feature.

Redirect: Route traffic away from affected components.

Mitigation first, root cause later. Stop the bleeding before surgery.

Resolution

Confirm the incident is resolved:

User-facing impact eliminated
Systems stable
Monitoring shows normal behavior

Update status page. Close the incident channel (but preserve history).

After the Incident

Postmortem

Every significant incident deserves a postmortem. The goal: learn and improve, not blame.

Postmortem template:

# Incident Postmortem: [Title]

## Summary
[One paragraph summary of what happened]

## Timeline
[Chronological list of events]

## Impact
[What was affected, for how long, how many users]

## Root Cause
[Why did this happen?]

## Contributing Factors
[What made it worse or detection slower?]

## Action Items
[What will we do to prevent recurrence?]

## Lessons Learned
[What did we learn?]

Blameless Culture

Postmortems must be blameless. If people fear blame, they hide information. Hidden information prevents learning.

Focus on systems, not individuals:

“The deployment process allowed an untested change” not “Alice deployed without testing”
“Alerting didn’t detect the issue” not “Nobody noticed the problem”

When humans make mistakes, ask what system allowed the mistake.

Action Item Follow-Through

Postmortems without action item follow-through are theater. Track action items to completion:

Assign owners
Set deadlines
Review in regular meetings
Close when done

Incomplete action items mean the same incident will recur.

Trend Analysis

Individual incidents matter. Patterns across incidents matter more.

Track:

Incident frequency by service, type, cause
Time to detect, time to resolve
On-call burden by team

Patterns reveal systemic issues. A service with frequent incidents needs investment. A team with high on-call burden needs support.

Tools

Alerting: PagerDuty, Opsgenie, VictorOps for on-call management and escalation.

Communication: Slack, Microsoft Teams with dedicated incident channels.

Status pages: Statuspage.io, Atlassian Statuspage, Cachet for external communication.

Documentation: Confluence, Notion, Google Docs for postmortems and runbooks.

Incident tracking: Jira, custom tools for tracking incidents and action items.

Choose tools that integrate with your workflow. The best tool is the one people actually use.

Key Takeaways

Define clear roles (IC, tech lead, communications, scribe) to eliminate ambiguity
Severity levels ensure appropriate response proportional to impact
On-call rotations must be sustainable to prevent burnout
Mitigate first, investigate root cause after impact is contained
Blameless postmortems focus on systems, not individuals
Track and complete action items; incomplete action items mean repeat incidents
Analyze trends across incidents to identify systemic issues