Incident Management: From Detection to Resolution

November 29, 2021

Incidents happen. Systems fail, deploys go wrong, dependencies break. What separates mature engineering organizations isn’t fewer incidents—it’s how they respond. Good incident management minimizes blast radius, speeds recovery, and turns failures into learning.

Here’s how to build effective incident management.

Incident Lifecycle

The Flow

Detection ──► Triage ──► Response ──► Resolution ──► Postmortem
    │           │          │            │              │
    ▼           ▼          ▼            ▼              ▼
  Alert      Severity   Mitigate    Confirm         Learn
  Report     Assign     Diagnose    Verify          Improve
  Notice     Escalate   Fix         Communicate     Prevent

Detection

Alerting Philosophy

alerting_principles:
  alert_on_symptoms:
    good: "Error rate > 5%"
    bad: "CPU > 80%"
    why: Users feel symptoms, not causes

  actionable:
    good: "Database connection pool exhausted"
    bad: "Something might be wrong"
    why: Alerts need clear response

  appropriate_urgency:
    page: User-facing impact, needs immediate attention
    ticket: Degradation that can wait hours
    log: Information for later analysis

  avoid_alert_fatigue:
    - No "just in case" alerts
    - Review and tune regularly
    - Every page should matter

Detection Sources

detection_sources:
  automated:
    - Monitoring alerts
    - Anomaly detection
    - Synthetic monitoring
    - Error tracking (Sentry, etc.)

  human:
    - Customer reports
    - Internal reports
    - Social media
    - Support tickets

  proactive:
    - Chaos engineering
    - Load testing
    - Security scanning
    - Capacity monitoring

Triage

Severity Classification

severity_levels:
  sev1_critical:
    criteria:
      - Complete service outage
      - Data loss or corruption
      - Security breach
      - Revenue-impacting
    response:
      - All hands on deck
      - Executive notification
      - War room activated
      - Public communication
    target_response: 5 minutes
    target_resolution: 1 hour

  sev2_major:
    criteria:
      - Partial outage
      - Significant degradation
      - Workaround exists but painful
    response:
      - On-call responds
      - Team escalation if needed
      - Status page update
    target_response: 15 minutes
    target_resolution: 4 hours

  sev3_minor:
    criteria:
      - Minor feature impact
      - Easy workaround available
      - Small user subset
    response:
      - On-call investigates
      - Fix in normal workflow
    target_response: 1 hour
    target_resolution: 24 hours

  sev4_low:
    criteria:
      - Cosmetic issues
      - No user impact
      - Can wait for normal process
    response:
      - Create ticket
      - Prioritize with normal work
    target_response: Next business day

Initial Assessment

triage_questions:
  impact:
    - How many users affected?
    - Which features impacted?
    - Is there data impact?
    - What's the business impact?

  scope:
    - Is it getting worse?
    - Is it contained?
    - What services involved?

  urgency:
    - Can it wait?
    - Is there a workaround?
    - What's the trend?

Response

Incident Command

incident_roles:
  incident_commander:
    responsibilities:
      - Owns incident resolution
      - Coordinates response
      - Makes decisions
      - Escalates when needed
    not_responsible_for:
      - Fixing the problem directly
      - Writing the postmortem

  communications_lead:
    responsibilities:
      - Status page updates
      - Internal communication
      - Customer communication
      - Stakeholder updates

  technical_lead:
    responsibilities:
      - Leads technical investigation
      - Coordinates technical responders
      - Proposes solutions
      - Implements fixes

  scribe:
    responsibilities:
      - Documents timeline
      - Records decisions
      - Captures key information
      - Maintains incident channel

Communication

communication_templates:
  initial_notice:
    internal: |
      🔴 INCIDENT: [Brief description]
      Severity: [SEV1/2/3]
      Impact: [User-facing impact]
      Status: Investigating
      IC: @name
      Channel: #incident-[number]

  status_update:
    interval: Every 30 minutes minimum for SEV1/2
    template: |
      Update: [What we know]
      Actions: [What we're doing]
      ETA: [When we expect resolution, if known]
      Next update: [Time]

  customer_facing:
    avoid:
      - Technical jargon
      - Blame
      - Speculation
    include:
      - What's happening
      - Impact on customers
      - What we're doing
      - When to expect update

Mitigation Strategies

mitigation_priority:
  1_stop_bleeding:
    - Rollback deployment
    - Disable feature flag
    - Scale up
    - Failover
    - Block bad traffic

  2_restore_service:
    - Restart services
    - Clear queues
    - Restore from backup
    - Switch to degraded mode

  3_root_cause:
    - Only after stable
    - Proper investigation
    - No rush to blame

Runbooks

# Example runbook
runbook: database_connection_exhaustion

symptoms:
  - "Connection pool exhausted" errors
  - Request timeouts
  - Slow response times

investigation:
  - Check: pg_stat_activity for connections
  - Check: Application connection pool metrics
  - Check: Recent deployments
  - Check: Traffic spike

mitigation:
  quick:
    - Increase connection pool size (if headroom)
    - Restart application pods
    - Kill idle connections: SELECT pg_terminate_backend(pid)...

  longer_term:
    - Identify connection leak
    - Review connection pool configuration
    - Consider PgBouncer

escalation:
  - If database unresponsive: Page DBA
  - If application-side: Page service owner

references:
  - Link to monitoring dashboard
  - Link to database documentation

Resolution

Confirming Resolution

resolution_checklist:
  verify:
    - [ ] Error rates back to normal
    - [ ] User-facing metrics recovered
    - [ ] No new related errors
    - [ ] Synthetic checks passing

  communicate:
    - [ ] Update status page
    - [ ] Notify stakeholders
    - [ ] Close incident channel (after delay)

  preserve:
    - [ ] Capture timeline
    - [ ] Save relevant logs/metrics
    - [ ] Note key decisions
    - [ ] Schedule postmortem

Postmortem

Blameless Culture

blameless_postmortem:
  principles:
    - People are not the root cause
    - Systems allow failures
    - Humans make mistakes—systems should prevent/catch
    - Focus on learning, not blame

  language:
    avoid: "John should have checked..."
    prefer: "The deployment process didn't verify..."

    avoid: "The engineer made an error"
    prefer: "The system accepted invalid input"

Postmortem Template

postmortem:
  title: "[Date] - Brief description"

  summary:
    duration: Start to resolution time
    impact: Users/revenue affected
    severity: SEV level
    root_cause: One-line summary

  timeline:
    - "10:00 - Deployment started"
    - "10:15 - First alerts fired"
    - "10:20 - IC assigned"
    - "10:45 - Root cause identified"
    - "11:00 - Rollback completed"
    - "11:15 - Service recovered"

  root_cause:
    what_happened: Technical explanation
    contributing_factors:
      - Factor 1
      - Factor 2

  detection:
    how_detected: Alert/report/etc
    could_detect_sooner: Yes/No, how

  response:
    what_worked: Good parts of response
    what_didnt: Areas for improvement
    response_time: Was it appropriate?

  action_items:
    - item: "Add validation for X"
      owner: "@name"
      priority: P1
      due: "2 weeks"

    - item: "Improve alert for Y"
      owner: "@name"
      priority: P2
      due: "1 month"

  lessons_learned:
    - Lesson 1
    - Lesson 2

Metrics

Incident Metrics

metrics_to_track:
  detection:
    - Time to detect (TTD)
    - Detection source breakdown

  response:
    - Time to respond (TTR)
    - Time to mitigate
    - Time to resolve

  impact:
    - Downtime minutes
    - Users affected
    - Revenue impact
    - SLO impact

  trends:
    - Incidents per week/month
    - Severity distribution
    - Repeat incidents
    - Action item completion rate

Key Takeaways

Incidents are inevitable. How you handle them defines your reliability culture.