The current crisis has tested every organization’s ability to continue operating. Engineering teams that prepared for disruption are weathering it better than those who assumed normalcy would continue.
Here’s how to build engineering practices that ensure business continuity.
What Business Continuity Means
Beyond Disaster Recovery
Disaster recovery is about systems. Business continuity is about the business—including people, processes, and capabilities.
disaster_recovery:
- Server fails → failover to backup
- Database corrupted → restore from backup
- Region down → route to other regions
business_continuity:
- Key person unavailable → others can do the work
- Office inaccessible → work continues remotely
- Normal processes impossible → alternatives exist
- Vendor unavailable → backup options ready
Key Questions
For every critical function, ask:
1. What happens if this stops for an hour? A day? A week?
2. Who can perform this function? What if they're unavailable?
3. What systems are required? What if they're down?
4. What external dependencies exist? What if they fail?
5. What's the minimum viable operation?
Technical Resilience
Eliminate Single Points of Failure
single_points:
infrastructure:
- One database server → Add replica
- One availability zone → Multi-AZ deployment
- One region → Multi-region where critical
- One cloud provider → (Consider, high effort)
access:
- One VPN gateway → Multiple gateways
- One authentication provider → Backup auth
- One network path → Redundant connectivity
operations:
- One person knows the system → Document and cross-train
- One way to deploy → Multiple paths
- One monitoring system → Redundant alerting
Infrastructure Access
Ensure teams can reach production:
access_continuity:
vpn:
- Capacity for 100% remote workforce
- Multiple regions/endpoints
- Tested under load
authentication:
- Works from any network
- Backup methods (MFA options)
- Emergency access procedures
tools:
- Cloud-accessible or mirrored
- No dependency on office network
- Documented access procedures
Deployment Capability
Can you ship code from anywhere?
deployment_requirements:
ci_cd:
- Hosted CI/CD (GitHub Actions, etc.)
- No dependency on office infrastructure
- Secrets accessible securely
artifacts:
- Cloud-hosted registries
- Redundant storage
- Access from anywhere
production_access:
- Secure remote access
- Audit logging
- Emergency procedures documented
Knowledge Continuity
Documentation
What happens if key people are unavailable?
critical_documentation:
architecture:
- System diagrams
- Data flows
- Integration points
operations:
- Runbooks for common tasks
- Incident response procedures
- Escalation paths
access:
- How to get credentials
- Who can grant access
- Emergency procedures
decisions:
- Why we built it this way
- Trade-offs considered
- Context for future changes
Cross-Training
Reduce key-person dependencies:
practices:
rotation:
- Rotate on-call across team
- Different people deploy each time
- Pair programming on critical systems
shadowing:
- Juniors shadow seniors on operations
- Document while shadowing
- Gradually increase responsibility
exercises:
- Regular "wheel of misfortune" drills
- Random person handles incident
- Expose gaps, improve docs
Bus Factor Analysis
## Bus Factor Assessment
For each critical system/function:
| System/Function | Primary | Backup | Documented | Bus Factor |
|-----------------|---------|--------|------------|------------|
| Payment processing | Alice | Bob | Yes | 2 |
| Customer DB admin | Alice | None | No | 1 ⚠️ |
| CI/CD pipeline | Bob | Carol | Yes | 2 |
| Incident response | Everyone | - | Yes | 6 |
Action items:
- Customer DB: Cross-train Bob, document procedures
Process Continuity
Remote-Ready Processes
Can your processes work without an office?
evaluate:
meetings:
- Can all be done via video?
- Tooling in place?
- Time zones accommodated?
collaboration:
- Can design work happen remotely?
- Code review process works async?
- Documentation accessible?
communication:
- Async channels established?
- Urgent escalation works?
- Decisions can be made?
onboarding:
- New hires can be effective remotely?
- Equipment delivery works?
- Training is accessible?
Minimum Viable Operations
What’s essential vs. nice-to-have?
critical:
- Production systems running
- Security incidents responded to
- Customer-impacting issues fixed
- Critical bugs addressed
important:
- New features shipped
- Technical debt addressed
- Documentation improved
- Training and development
deferrable:
- Nice-to-have features
- Cosmetic improvements
- Long-term projects
- Non-critical optimization
Testing Continuity
Regular Drills
Practice before you need it:
drills:
technical:
- Failover testing (monthly)
- Backup restoration (quarterly)
- Disaster recovery (annually)
operational:
- Remote work day (monthly)
- Key person unavailable simulation
- Vendor failure scenario
incident:
- Tabletop exercises
- Game day chaos testing
- Communication drills
Post-Event Learning
When things go wrong, learn:
## Post-Incident Review
### What happened?
Timeline of events
### What went well?
- Monitoring detected issue quickly
- Runbook worked as documented
### What didn't work?
- Backup contact was out of date
- Took 20 minutes to find credentials
### Action items:
- [ ] Update contact information (Owner: Alice, Due: Friday)
- [ ] Move credentials to password manager (Owner: Bob, Due: Next week)
Vendor and Dependency Continuity
Critical Vendor Assessment
## Vendor Continuity Assessment
| Vendor | Criticality | Alternative | Time to Switch | Notes |
|--------|-------------|-------------|----------------|-------|
| AWS | High | GCP/Azure | Months | Multi-region helps |
| Stripe | High | Braintree | Weeks | Abstract payment layer |
| PagerDuty | Medium | Opsgenie | Days | Export runbooks |
| Jira | Low | Linear | Weeks | Export data possible |
Dependency Isolation
Reduce blast radius of vendor issues:
practices:
- Abstract third-party integrations
- Queue between your system and vendors
- Graceful degradation when vendor down
- Caching to survive temporary outages
- Multiple providers where critical
Communication During Crisis
Clear Channels
channels:
urgent:
- PagerDuty for incidents
- Phone tree for emergencies
- SMS for critical updates
regular:
- Slack for day-to-day
- Email for formal communication
- Video for synchronous discussion
external:
- Status page for customers
- Email updates for stakeholders
- Social media for public communication
Escalation Paths
escalation:
level_1: On-call engineer
level_2: Engineering manager
level_3: VP Engineering
level_4: Executive team
criteria:
- Customer impact duration
- Revenue impact
- Data security issues
- Public attention
Key Takeaways
- Business continuity is about people and processes, not just systems
- Identify single points of failure in people, knowledge, and systems
- Ensure remote access to everything needed for operations
- Document critical knowledge; don’t let it live only in heads
- Cross-train team members; bus factor should be at least 2
- Test continuity regularly through drills and exercises
- Know your critical vs. deferrable work for reduced capacity
- Assess vendor dependencies and have alternatives
- Clear communication channels and escalation paths are essential
- Learn from every incident to improve resilience
Crisis reveals what was already weak. Build resilience before you need it, and your team will handle disruption with confidence.