Site Reliability Engineering: Core Principles

April 30, 2018

Site Reliability Engineering (SRE) emerged from Google’s need to run services at massive scale. The core insight: apply software engineering approaches to operations problems. Since Google’s SRE book in 2016, the practice has spread widely.

Here are the core principles that make SRE work.

Foundational Concepts

Service Level Objectives (SLOs)

SLOs define the target reliability for a service:

SLO: 99.9% of requests succeed with latency < 200ms

Components:

Service Level Indicators (SLIs): What you measure

Service Level Objectives (SLOs): Your targets

Service Level Agreements (SLAs): External commitments

Error Budgets

The error budget is the difference between perfect reliability and your SLO:

If SLO = 99.9% availability
Error budget = 0.1% (about 8.7 hours/year of downtime)

How error budgets work:

When budget is healthy:

When budget is depleted:

Error budgets align incentives. Development wants features. Operations wants stability. Error budgets make the tradeoff explicit and data-driven.

Embracing Risk

100% reliability is impossible and uneconomic. The question is: how much reliability is enough?

Reliability costs:

Consider:

Different services need different reliability. A payment system needs higher reliability than a internal analytics dashboard.

Toil Elimination

What Is Toil?

Toil is work that:

Examples:

Not toil:

Why Eliminate Toil?

Toil doesn’t scale. If you spend 50% of time on toil today, what happens when you have 10x more services?

SRE target: Less than 50% time on toil, at least 50% on engineering.

Automation Strategy

Order of preference:

  1. Eliminate the need: Can we not do this at all?
  2. Automate fully: No human involvement
  3. Automate with human trigger: One-click execution
  4. Automate partially: Some manual steps remain
  5. Document: If can’t automate, at least document

Automation decisions:

Monitoring and Alerting

The Four Golden Signals

Monitor these for any service:

Latency: Time to respond to requests

Traffic: Demand on your service

Errors: Rate of failed requests

Saturation: How full your service is

Alerting Philosophy

Alert on symptoms, not causes:

# Bad - alerting on cause
Alert: CPU > 80%

# Good - alerting on symptom
Alert: Request latency P99 > 500ms

Every alert should be actionable:

Page only for user-impacting issues:

Reducing Alert Fatigue

Symptoms of alert fatigue:

Solutions:

Release Engineering

Progressive Rollouts

Don’t ship to 100% immediately:

Canary (1%) → Stage 1 (10%) → Stage 2 (50%) → Full (100%)

At each stage:

Rollback Capability

Fast rollback is essential:

# One command to roll back
kubectl rollout undo deployment/api

# Or: deploy previous version
deploy --version=v1.2.3

Rollback should be:

Change Management

Not all changes are equal:

Change TypeRiskRollback
Feature flagLowInstant
Config changeMediumFast
Code deployMediumMinutes
Schema migrationHighComplicated
InfrastructureHighHours

Higher risk changes need:

Capacity Planning

Organic Growth

Forecast based on historical growth:

Current: 1000 QPS
Growth rate: 10% per month
In 12 months: 1000 * 1.1^12 ≈ 3138 QPS

Add headroom for:

Inorganic Growth

Step changes in demand:

Plan for these explicitly. Organic models don’t capture step changes.

Capacity Testing

Verify capacity assumptions:

Load testing: Can we handle expected load? Stress testing: Where do we break? Soak testing: Does performance degrade over time?

Incident Management

Incident Response

When incidents happen:

  1. Mitigate: Stop the bleeding first
  2. Communicate: Keep stakeholders informed
  3. Investigate: Find root cause after mitigation
  4. Fix: Address root cause
  5. Learn: Postmortem and improvement

Blameless Postmortems

Focus on systems, not individuals:

What happened? (Timeline) Why did it happen? (Contributing factors) How can we prevent recurrence? (Action items)

Never: “Bob made a mistake.” Instead: “The deployment process allowed an untested change to reach production.”

On-Call Practices

Sustainable on-call:

Implementing SRE

Start Small

Don’t try to adopt everything at once:

  1. SLOs first: Define what reliability means
  2. Error budgets: Create shared understanding
  3. Monitoring: Measure what matters
  4. Toil reduction: Automate one thing at a time

Organizational Models

Embedded SRE: SREs within product teams Centralized SRE: Separate SRE organization Hybrid: Consulting SREs with embedded on-call

Start with what fits your organization.

Scaling SRE

As you grow:

Key Takeaways

SRE is a mindset as much as a set of practices. Apply engineering approaches to operational problems.