Site Reliability Engineering: Core Principles

Site Reliability Engineering (SRE) emerged from Google’s need to run services at massive scale. The core insight: apply software engineering approaches to operations problems. Since Google’s SRE book in 2016, the practice has spread widely.

Here are the core principles that make SRE work.

Foundational Concepts

Service Level Objectives (SLOs)

SLOs define the target reliability for a service:

SLO: 99.9% of requests succeed with latency < 200ms

Components:

Service Level Indicators (SLIs): What you measure

Request success rate
Latency percentiles
Throughput
Error rate

Service Level Objectives (SLOs): Your targets

99.9% availability
P99 latency < 200ms
Error rate < 0.1%

Service Level Agreements (SLAs): External commitments

Contractual promises (usually looser than SLOs)
Financial consequences for breach

Error Budgets

The error budget is the difference between perfect reliability and your SLO:

If SLO = 99.9% availability
Error budget = 0.1% (about 8.7 hours/year of downtime)

How error budgets work:

When budget is healthy:

Ship features faster
Take more risks
Experiment

When budget is depleted:

Focus on reliability
Reduce change velocity
Fix systemic issues

Error budgets align incentives. Development wants features. Operations wants stability. Error budgets make the tradeoff explicit and data-driven.

Embracing Risk

100% reliability is impossible and uneconomic. The question is: how much reliability is enough?

Reliability costs:

More redundancy
More testing
Slower releases
Higher complexity

Consider:

What reliability do users actually need?
What’s the cost of additional reliability?
Where’s the point of diminishing returns?

Different services need different reliability. A payment system needs higher reliability than a internal analytics dashboard.

Toil Elimination

What Is Toil?

Toil is work that:

Is manual
Is repetitive
Can be automated
Scales with service size
Lacks enduring value
Is interrupt-driven

Examples:

Manually provisioning resources
Responding to alerts that have known fixes
Running deployment scripts
Manual data migrations

Not toil:

Designing systems
Writing code
Training
Strategic planning

Why Eliminate Toil?

Toil doesn’t scale. If you spend 50% of time on toil today, what happens when you have 10x more services?

SRE target: Less than 50% time on toil, at least 50% on engineering.

Automation Strategy

Order of preference:

Eliminate the need: Can we not do this at all?
Automate fully: No human involvement
Automate with human trigger: One-click execution
Automate partially: Some manual steps remain
Document: If can’t automate, at least document

Automation decisions:

How often does this happen?
How long does it take?
How error-prone is manual execution?
How long would automation take?

Monitoring and Alerting

The Four Golden Signals

Monitor these for any service:

Latency: Time to respond to requests

Distinguish successful from failed requests
Track percentiles, not just averages

Traffic: Demand on your service

Requests per second
Transactions per second
Network I/O

Errors: Rate of failed requests

HTTP 5xx
Application errors
Explicit failure modes

Saturation: How full your service is

Memory usage
CPU usage
Queue depth
I/O utilization

Alerting Philosophy

Alert on symptoms, not causes:

# Bad - alerting on cause
Alert: CPU > 80%

# Good - alerting on symptom
Alert: Request latency P99 > 500ms

Every alert should be actionable:

If you can’t do anything, don’t alert
Alerts should require intelligent human response
Automatable responses shouldn’t be alerts

Page only for user-impacting issues:

Is this affecting users now?
Will it affect users soon if not addressed?
If neither, it can wait for business hours

Reducing Alert Fatigue

Symptoms of alert fatigue:

Alerts being ignored
Pages during every on-call shift
“Normal” alerts
Autopilot responses

Solutions:

Fix underlying issues (not just silence alerts)
Tune thresholds with data
Automate responses
Remove non-actionable alerts

Release Engineering

Progressive Rollouts

Don’t ship to 100% immediately:

Canary (1%) → Stage 1 (10%) → Stage 2 (50%) → Full (100%)

At each stage:

Monitor error rates and latency
Compare to baseline
Automatic rollback on regression

Rollback Capability

Fast rollback is essential:

# One command to roll back
kubectl rollout undo deployment/api

# Or: deploy previous version
deploy --version=v1.2.3

Rollback should be:

Fast (minutes, not hours)
Safe (tested, not improvised)
Available (always deployable)

Change Management

Not all changes are equal:

Change Type	Risk	Rollback
Feature flag	Low	Instant
Config change	Medium	Fast
Code deploy	Medium	Minutes
Schema migration	High	Complicated
Infrastructure	High	Hours

Higher risk changes need:

More review
Slower rollout
Better rollback plans
More monitoring

Capacity Planning

Organic Growth

Forecast based on historical growth:

Current: 1000 QPS
Growth rate: 10% per month
In 12 months: 1000 * 1.1^12 ≈ 3138 QPS

Add headroom for:

Traffic spikes
Degradation scenarios
Launch events

Inorganic Growth

Step changes in demand:

New customer launches
Marketing campaigns
Viral events
Acquisitions

Plan for these explicitly. Organic models don’t capture step changes.

Capacity Testing

Verify capacity assumptions:

Load testing: Can we handle expected load? Stress testing: Where do we break? Soak testing: Does performance degrade over time?

Incident Management

Incident Response

When incidents happen:

Mitigate: Stop the bleeding first
Communicate: Keep stakeholders informed
Investigate: Find root cause after mitigation
Fix: Address root cause
Learn: Postmortem and improvement

Blameless Postmortems

Focus on systems, not individuals:

What happened? (Timeline) Why did it happen? (Contributing factors) How can we prevent recurrence? (Action items)

Never: “Bob made a mistake.” Instead: “The deployment process allowed an untested change to reach production.”

On-Call Practices

Sustainable on-call:

Rotation duration (1 week typical)
Response time expectations
Compensation (time off, pay)
Escalation paths
Secondary on-call

Implementing SRE

Start Small

Don’t try to adopt everything at once:

SLOs first: Define what reliability means
Error budgets: Create shared understanding
Monitoring: Measure what matters
Toil reduction: Automate one thing at a time

Organizational Models

Embedded SRE: SREs within product teams Centralized SRE: Separate SRE organization Hybrid: Consulting SREs with embedded on-call

Start with what fits your organization.

Scaling SRE

As you grow:

Document practices
Train new SREs
Build tooling
Create playbooks
Share learnings

Key Takeaways

SLOs define target reliability; error budgets make tradeoffs explicit
Eliminate toil through automation; target <50% toil time
Monitor the four golden signals: latency, traffic, errors, saturation
Alert on symptoms that require human response; reduce alert fatigue
Use progressive rollouts with automatic rollback capability
Plan capacity for both organic and inorganic growth
Conduct blameless postmortems focused on systemic improvements
Start with SLOs and error budgets; expand practices over time

SRE is a mindset as much as a set of practices. Apply engineering approaches to operational problems.