Incident Management for Growing Teams

October 23, 2017

When you’re a five-person startup, incident response is simple: everyone knows about problems immediately, someone fixes them, and you move on. There’s no process because there doesn’t need to be.

At fifty people with multiple teams and services, this informality breaks down. People don’t know who’s responsible for what. Communication during incidents is chaotic. Knowledge from incidents isn’t captured. The same issues recur.

Scaling incident management requires intentional process without bureaucratic overhead. Here’s how to build it.

Incident Response Framework

Clear Roles

During incidents, ambiguity costs time. Define roles clearly:

Incident Commander (IC): Owns the incident. Makes decisions. Coordinates responders. Communicates status. Doesn’t necessarily fix things—coordinates those who do.

Technical Lead: Drives technical investigation and remediation. May delegate to specialists.

Communications Lead: Handles stakeholder communication. Updates status pages. Drafts customer notifications.

Scribe: Documents timeline, decisions, and actions. Creates record for postmortem.

For smaller incidents, one person may fill multiple roles. For major incidents, dedicated people in each role prevent context switching.

Severity Levels

Not all incidents are equal. Define severity levels with clear criteria:

SEV-1 (Critical):

Response: All hands, immediate escalation, external communication.

SEV-2 (High):

Response: On-call team, escalation to relevant teams, status page update.

SEV-3 (Medium):

Response: On-call handles, escalate if needed.

SEV-4 (Low):

Response: Normal workflow, fix when convenient.

Clear severity levels ensure appropriate response. A SEV-1 shouldn’t get SEV-4 treatment, and SEV-4 shouldn’t trigger SEV-1 panic.

Communication Channels

Define where incident communication happens:

Incident channel: Dedicated chat channel per incident. All responders join. Technical discussion happens here.

Status page: External communication. Customers see current status and updates.

Internal updates: Regular updates to leadership and stakeholders. Separate from technical channel.

Separation prevents noise from reaching those who don’t need it and ensures relevant parties stay informed.

On-Call Structure

Rotation Design

Design on-call rotations that are sustainable:

Coverage: 24/7 coverage requires either follow-the-sun (teams in different time zones) or rotation (each person covers nights periodically).

Duration: Week-long rotations are common. Shorter rotations (2-3 days) reduce individual burden but increase handoff frequency.

Handoff: Explicit handoff between rotation shifts. Outgoing on-call briefs incoming on current issues.

Backup: Secondary on-call for escalation or if primary is unavailable.

On-Call Expectations

Set clear expectations:

Unclear expectations create frustration and inconsistent response.

Preventing Burnout

On-call is stressful. Prevent burnout:

Burned-out engineers make poor decisions during incidents.

During the Incident

Initial Response

When an incident is detected:

  1. Acknowledge: Confirm you’re responding.
  2. Assess severity: Determine severity level.
  3. Open incident channel: Create communication channel.
  4. Page relevant people: If severity warrants.
  5. Start timeline: Document what you know.

Investigation

Follow the symptoms: Start with what users are experiencing. Work backward to causes.

Check recent changes: Deployments, config changes, traffic patterns. Correlation isn’t causation, but changes are prime suspects.

Use observability: Metrics, logs, traces. Let data guide investigation.

Communicate progress: Regular updates even when “still investigating.”

Mitigation

Fix the problem or reduce impact:

Rollback: If recent change caused it, roll back.

Restart: Sometimes services just need restart.

Scale: If overwhelmed, add capacity.

Disable: Turn off problematic feature.

Redirect: Route traffic away from affected components.

Mitigation first, root cause later. Stop the bleeding before surgery.

Resolution

Confirm the incident is resolved:

Update status page. Close the incident channel (but preserve history).

After the Incident

Postmortem

Every significant incident deserves a postmortem. The goal: learn and improve, not blame.

Postmortem template:

# Incident Postmortem: [Title]

## Summary
[One paragraph summary of what happened]

## Timeline
[Chronological list of events]

## Impact
[What was affected, for how long, how many users]

## Root Cause
[Why did this happen?]

## Contributing Factors
[What made it worse or detection slower?]

## Action Items
[What will we do to prevent recurrence?]

## Lessons Learned
[What did we learn?]

Blameless Culture

Postmortems must be blameless. If people fear blame, they hide information. Hidden information prevents learning.

Focus on systems, not individuals:

When humans make mistakes, ask what system allowed the mistake.

Action Item Follow-Through

Postmortems without action item follow-through are theater. Track action items to completion:

Incomplete action items mean the same incident will recur.

Trend Analysis

Individual incidents matter. Patterns across incidents matter more.

Track:

Patterns reveal systemic issues. A service with frequent incidents needs investment. A team with high on-call burden needs support.

Tools

Alerting: PagerDuty, Opsgenie, VictorOps for on-call management and escalation.

Communication: Slack, Microsoft Teams with dedicated incident channels.

Status pages: Statuspage.io, Atlassian Statuspage, Cachet for external communication.

Documentation: Confluence, Notion, Google Docs for postmortems and runbooks.

Incident tracking: Jira, custom tools for tracking incidents and action items.

Choose tools that integrate with your workflow. The best tool is the one people actually use.

Key Takeaways