Business Continuity for Engineering Teams

April 6, 2020

The current crisis has tested every organization’s ability to continue operating. Engineering teams that prepared for disruption are weathering it better than those who assumed normalcy would continue.

Here’s how to build engineering practices that ensure business continuity.

What Business Continuity Means

Beyond Disaster Recovery

Disaster recovery is about systems. Business continuity is about the business—including people, processes, and capabilities.

disaster_recovery:
  - Server fails → failover to backup
  - Database corrupted → restore from backup
  - Region down → route to other regions

business_continuity:
  - Key person unavailable → others can do the work
  - Office inaccessible → work continues remotely
  - Normal processes impossible → alternatives exist
  - Vendor unavailable → backup options ready

Key Questions

For every critical function, ask:
1. What happens if this stops for an hour? A day? A week?
2. Who can perform this function? What if they're unavailable?
3. What systems are required? What if they're down?
4. What external dependencies exist? What if they fail?
5. What's the minimum viable operation?

Technical Resilience

Eliminate Single Points of Failure

single_points:
  infrastructure:
    - One database server → Add replica
    - One availability zone → Multi-AZ deployment
    - One region → Multi-region where critical
    - One cloud provider → (Consider, high effort)

  access:
    - One VPN gateway → Multiple gateways
    - One authentication provider → Backup auth
    - One network path → Redundant connectivity

  operations:
    - One person knows the system → Document and cross-train
    - One way to deploy → Multiple paths
    - One monitoring system → Redundant alerting

Infrastructure Access

Ensure teams can reach production:

access_continuity:
  vpn:
    - Capacity for 100% remote workforce
    - Multiple regions/endpoints
    - Tested under load

  authentication:
    - Works from any network
    - Backup methods (MFA options)
    - Emergency access procedures

  tools:
    - Cloud-accessible or mirrored
    - No dependency on office network
    - Documented access procedures

Deployment Capability

Can you ship code from anywhere?

deployment_requirements:
  ci_cd:
    - Hosted CI/CD (GitHub Actions, etc.)
    - No dependency on office infrastructure
    - Secrets accessible securely

  artifacts:
    - Cloud-hosted registries
    - Redundant storage
    - Access from anywhere

  production_access:
    - Secure remote access
    - Audit logging
    - Emergency procedures documented

Knowledge Continuity

Documentation

What happens if key people are unavailable?

critical_documentation:
  architecture:
    - System diagrams
    - Data flows
    - Integration points

  operations:
    - Runbooks for common tasks
    - Incident response procedures
    - Escalation paths

  access:
    - How to get credentials
    - Who can grant access
    - Emergency procedures

  decisions:
    - Why we built it this way
    - Trade-offs considered
    - Context for future changes

Cross-Training

Reduce key-person dependencies:

practices:
  rotation:
    - Rotate on-call across team
    - Different people deploy each time
    - Pair programming on critical systems

  shadowing:
    - Juniors shadow seniors on operations
    - Document while shadowing
    - Gradually increase responsibility

  exercises:
    - Regular "wheel of misfortune" drills
    - Random person handles incident
    - Expose gaps, improve docs

Bus Factor Analysis

## Bus Factor Assessment

For each critical system/function:

| System/Function | Primary | Backup | Documented | Bus Factor |
|-----------------|---------|--------|------------|------------|
| Payment processing | Alice | Bob | Yes | 2 |
| Customer DB admin | Alice | None | No | 1 ⚠️ |
| CI/CD pipeline | Bob | Carol | Yes | 2 |
| Incident response | Everyone | - | Yes | 6 |

Action items:
- Customer DB: Cross-train Bob, document procedures

Process Continuity

Remote-Ready Processes

Can your processes work without an office?

evaluate:
  meetings:
    - Can all be done via video?
    - Tooling in place?
    - Time zones accommodated?

  collaboration:
    - Can design work happen remotely?
    - Code review process works async?
    - Documentation accessible?

  communication:
    - Async channels established?
    - Urgent escalation works?
    - Decisions can be made?

  onboarding:
    - New hires can be effective remotely?
    - Equipment delivery works?
    - Training is accessible?

Minimum Viable Operations

What’s essential vs. nice-to-have?

critical:
  - Production systems running
  - Security incidents responded to
  - Customer-impacting issues fixed
  - Critical bugs addressed

important:
  - New features shipped
  - Technical debt addressed
  - Documentation improved
  - Training and development

deferrable:
  - Nice-to-have features
  - Cosmetic improvements
  - Long-term projects
  - Non-critical optimization

Testing Continuity

Regular Drills

Practice before you need it:

drills:
  technical:
    - Failover testing (monthly)
    - Backup restoration (quarterly)
    - Disaster recovery (annually)

  operational:
    - Remote work day (monthly)
    - Key person unavailable simulation
    - Vendor failure scenario

  incident:
    - Tabletop exercises
    - Game day chaos testing
    - Communication drills

Post-Event Learning

When things go wrong, learn:

## Post-Incident Review

### What happened?
Timeline of events

### What went well?
- Monitoring detected issue quickly
- Runbook worked as documented

### What didn't work?
- Backup contact was out of date
- Took 20 minutes to find credentials

### Action items:
- [ ] Update contact information (Owner: Alice, Due: Friday)
- [ ] Move credentials to password manager (Owner: Bob, Due: Next week)

Vendor and Dependency Continuity

Critical Vendor Assessment

## Vendor Continuity Assessment

| Vendor | Criticality | Alternative | Time to Switch | Notes |
|--------|-------------|-------------|----------------|-------|
| AWS | High | GCP/Azure | Months | Multi-region helps |
| Stripe | High | Braintree | Weeks | Abstract payment layer |
| PagerDuty | Medium | Opsgenie | Days | Export runbooks |
| Jira | Low | Linear | Weeks | Export data possible |

Dependency Isolation

Reduce blast radius of vendor issues:

practices:
  - Abstract third-party integrations
  - Queue between your system and vendors
  - Graceful degradation when vendor down
  - Caching to survive temporary outages
  - Multiple providers where critical

Communication During Crisis

Clear Channels

channels:
  urgent:
    - PagerDuty for incidents
    - Phone tree for emergencies
    - SMS for critical updates

  regular:
    - Slack for day-to-day
    - Email for formal communication
    - Video for synchronous discussion

  external:
    - Status page for customers
    - Email updates for stakeholders
    - Social media for public communication

Escalation Paths

escalation:
  level_1: On-call engineer
  level_2: Engineering manager
  level_3: VP Engineering
  level_4: Executive team

  criteria:
    - Customer impact duration
    - Revenue impact
    - Data security issues
    - Public attention

Key Takeaways

Crisis reveals what was already weak. Build resilience before you need it, and your team will handle disruption with confidence.