Site Reliability Engineering (SRE) emerged from Google’s need to run services at massive scale. The core insight: apply software engineering approaches to operations problems. Since Google’s SRE book in 2016, the practice has spread widely.
Here are the core principles that make SRE work.
Foundational Concepts
Service Level Objectives (SLOs)
SLOs define the target reliability for a service:
SLO: 99.9% of requests succeed with latency < 200ms
Components:
Service Level Indicators (SLIs): What you measure
- Request success rate
- Latency percentiles
- Throughput
- Error rate
Service Level Objectives (SLOs): Your targets
- 99.9% availability
- P99 latency < 200ms
- Error rate < 0.1%
Service Level Agreements (SLAs): External commitments
- Contractual promises (usually looser than SLOs)
- Financial consequences for breach
Error Budgets
The error budget is the difference between perfect reliability and your SLO:
If SLO = 99.9% availability
Error budget = 0.1% (about 8.7 hours/year of downtime)
How error budgets work:
When budget is healthy:
- Ship features faster
- Take more risks
- Experiment
When budget is depleted:
- Focus on reliability
- Reduce change velocity
- Fix systemic issues
Error budgets align incentives. Development wants features. Operations wants stability. Error budgets make the tradeoff explicit and data-driven.
Embracing Risk
100% reliability is impossible and uneconomic. The question is: how much reliability is enough?
Reliability costs:
- More redundancy
- More testing
- Slower releases
- Higher complexity
Consider:
- What reliability do users actually need?
- What’s the cost of additional reliability?
- Where’s the point of diminishing returns?
Different services need different reliability. A payment system needs higher reliability than a internal analytics dashboard.
Toil Elimination
What Is Toil?
Toil is work that:
- Is manual
- Is repetitive
- Can be automated
- Scales with service size
- Lacks enduring value
- Is interrupt-driven
Examples:
- Manually provisioning resources
- Responding to alerts that have known fixes
- Running deployment scripts
- Manual data migrations
Not toil:
- Designing systems
- Writing code
- Training
- Strategic planning
Why Eliminate Toil?
Toil doesn’t scale. If you spend 50% of time on toil today, what happens when you have 10x more services?
SRE target: Less than 50% time on toil, at least 50% on engineering.
Automation Strategy
Order of preference:
- Eliminate the need: Can we not do this at all?
- Automate fully: No human involvement
- Automate with human trigger: One-click execution
- Automate partially: Some manual steps remain
- Document: If can’t automate, at least document
Automation decisions:
- How often does this happen?
- How long does it take?
- How error-prone is manual execution?
- How long would automation take?
Monitoring and Alerting
The Four Golden Signals
Monitor these for any service:
Latency: Time to respond to requests
- Distinguish successful from failed requests
- Track percentiles, not just averages
Traffic: Demand on your service
- Requests per second
- Transactions per second
- Network I/O
Errors: Rate of failed requests
- HTTP 5xx
- Application errors
- Explicit failure modes
Saturation: How full your service is
- Memory usage
- CPU usage
- Queue depth
- I/O utilization
Alerting Philosophy
Alert on symptoms, not causes:
# Bad - alerting on cause
Alert: CPU > 80%
# Good - alerting on symptom
Alert: Request latency P99 > 500ms
Every alert should be actionable:
- If you can’t do anything, don’t alert
- Alerts should require intelligent human response
- Automatable responses shouldn’t be alerts
Page only for user-impacting issues:
- Is this affecting users now?
- Will it affect users soon if not addressed?
- If neither, it can wait for business hours
Reducing Alert Fatigue
Symptoms of alert fatigue:
- Alerts being ignored
- Pages during every on-call shift
- “Normal” alerts
- Autopilot responses
Solutions:
- Fix underlying issues (not just silence alerts)
- Tune thresholds with data
- Automate responses
- Remove non-actionable alerts
Release Engineering
Progressive Rollouts
Don’t ship to 100% immediately:
Canary (1%) → Stage 1 (10%) → Stage 2 (50%) → Full (100%)
At each stage:
- Monitor error rates and latency
- Compare to baseline
- Automatic rollback on regression
Rollback Capability
Fast rollback is essential:
# One command to roll back
kubectl rollout undo deployment/api
# Or: deploy previous version
deploy --version=v1.2.3
Rollback should be:
- Fast (minutes, not hours)
- Safe (tested, not improvised)
- Available (always deployable)
Change Management
Not all changes are equal:
| Change Type | Risk | Rollback |
|---|---|---|
| Feature flag | Low | Instant |
| Config change | Medium | Fast |
| Code deploy | Medium | Minutes |
| Schema migration | High | Complicated |
| Infrastructure | High | Hours |
Higher risk changes need:
- More review
- Slower rollout
- Better rollback plans
- More monitoring
Capacity Planning
Organic Growth
Forecast based on historical growth:
Current: 1000 QPS
Growth rate: 10% per month
In 12 months: 1000 * 1.1^12 ≈ 3138 QPS
Add headroom for:
- Traffic spikes
- Degradation scenarios
- Launch events
Inorganic Growth
Step changes in demand:
- New customer launches
- Marketing campaigns
- Viral events
- Acquisitions
Plan for these explicitly. Organic models don’t capture step changes.
Capacity Testing
Verify capacity assumptions:
Load testing: Can we handle expected load? Stress testing: Where do we break? Soak testing: Does performance degrade over time?
Incident Management
Incident Response
When incidents happen:
- Mitigate: Stop the bleeding first
- Communicate: Keep stakeholders informed
- Investigate: Find root cause after mitigation
- Fix: Address root cause
- Learn: Postmortem and improvement
Blameless Postmortems
Focus on systems, not individuals:
What happened? (Timeline) Why did it happen? (Contributing factors) How can we prevent recurrence? (Action items)
Never: “Bob made a mistake.” Instead: “The deployment process allowed an untested change to reach production.”
On-Call Practices
Sustainable on-call:
- Rotation duration (1 week typical)
- Response time expectations
- Compensation (time off, pay)
- Escalation paths
- Secondary on-call
Implementing SRE
Start Small
Don’t try to adopt everything at once:
- SLOs first: Define what reliability means
- Error budgets: Create shared understanding
- Monitoring: Measure what matters
- Toil reduction: Automate one thing at a time
Organizational Models
Embedded SRE: SREs within product teams Centralized SRE: Separate SRE organization Hybrid: Consulting SREs with embedded on-call
Start with what fits your organization.
Scaling SRE
As you grow:
- Document practices
- Train new SREs
- Build tooling
- Create playbooks
- Share learnings
Key Takeaways
- SLOs define target reliability; error budgets make tradeoffs explicit
- Eliminate toil through automation; target <50% toil time
- Monitor the four golden signals: latency, traffic, errors, saturation
- Alert on symptoms that require human response; reduce alert fatigue
- Use progressive rollouts with automatic rollback capability
- Plan capacity for both organic and inorganic growth
- Conduct blameless postmortems focused on systemic improvements
- Start with SLOs and error budgets; expand practices over time
SRE is a mindset as much as a set of practices. Apply engineering approaches to operational problems.