Effective SLOs: Beyond the Basics

May 20, 2019

Service Level Objectives (SLOs) have become standard practice, but many implementations miss the point. Teams set targets that don’t reflect user experience, create error budgets they don’t use, and generate dashboards nobody watches.

Here’s how to implement SLOs that actually improve reliability.

Choosing the Right SLIs

Measure What Users Experience

Bad SLI: Server CPU utilization
Good SLI: Request success rate as seen by users

Bad SLI: Database query latency
Good SLI: Page load time as experienced by users

Bad SLI: Pod health status
Good SLI: Successful checkout completion rate

Users don’t care about your infrastructure metrics.

The Four Golden Signals

For most services:

Availability: Percentage of successful requests

SLI = successful_requests / total_requests

Latency: Response time distribution

SLI = requests_under_threshold / total_requests

Throughput: Requests handled per second

SLI = requests_per_second (capacity indicator)

Error Rate: Percentage of errors

SLI = error_requests / total_requests

User Journey SLIs

Beyond technical metrics, measure user journeys:

Checkout Journey SLI:
- Cart loads: 99.9%
- Payment processed: 99.5%
- Order confirmed: 99.9%
- Confirmation email: 99.0%

Combined journey success: 99.9% × 99.5% × 99.9% × 99.0% = 98.3%

Setting Realistic Targets

Start with Reality

Measure current performance before setting targets:

Week 1-4: Measure current availability
Result: 99.7% average, 99.2% worst week

Target options:
- 99.9% (aspirational, requires investment)
- 99.5% (achievable with current system)
- 99.7% (matches current reality)

Don’t set 99.99% if you can’t meet 99.9%.

Consider the User

Different services need different reliability:

ServiceSLORationale
Payment processing99.95%High business impact
Product catalog99.5%Degraded experience acceptable
Admin dashboard99.0%Internal, lower priority
Analytics95.0%Async, can catch up

Account for Dependencies

If Service A calls Service B calls Service C:
- Service C: 99.9% availability
- Service B: 99.9% × 99.9% = 99.8%
- Service A: 99.8% × 99.9% = 99.7%

Service A can't be more reliable than its dependencies.

Error Budgets That Work

Calculating Error Budgets

SLO: 99.9% availability
Error budget: 0.1% = 43.2 minutes per month

SLO: 99.5% availability
Error budget: 0.5% = 3.6 hours per month

SLO: 99.0% availability
Error budget: 1.0% = 7.3 hours per month

Using Error Budgets

Error budgets are a policy tool:

Budget Status → Actions

Healthy (>50%):
- Ship features normally
- Experiment with new approaches
- Accept calculated risks

Depleted (25-50%):
- Slow down feature velocity
- Focus on reliability improvements
- More careful change management

Critical (<25%):
- Feature freeze
- All hands on reliability
- Post-mortem on consumption

Exhausted (0%):
- Complete freeze
- Emergency reliability work only

Error Budget Burns

Track how fast budget is consumed:

Weekly burn rate = errors_this_week / weekly_budget

If burn rate > 1: Consuming faster than budget allows
If burn rate < 0.5: Could ship faster
If burn rate ≈ 1: Sustainable pace

Alert on burn rate, not just budget remaining.

Implementation

Monitoring Setup

# Prometheus recording rules
groups:
  - name: slos
    rules:
      - record: slo:requests:success_rate
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      - record: slo:latency:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m]))
            by (le)
          )

Dashboards

SLO Dashboard Layout:
┌────────────────────────────────────────┐
│ Current Availability: 99.87%           │
│ Target: 99.9%                          │
│ Status: ✓ Meeting SLO                  │
├────────────────────────────────────────┤
│ Error Budget Remaining: 67%            │
│ ████████████░░░░░░░░                   │
│ Burn Rate: 0.8x (sustainable)          │
├────────────────────────────────────────┤
│ 30-Day Trend                           │
│ [availability over time graph]         │
└────────────────────────────────────────┘

Alerting

Alert on error budget burn, not instantaneous SLI:

# Multi-window burn rate alert
- alert: SLOBurnRateTooHigh
  expr: |
    (
      slo:error_budget:burn_rate:1h > 14.4
      and
      slo:error_budget:burn_rate:5m > 14.4
    ) or (
      slo:error_budget:burn_rate:6h > 6
      and
      slo:error_budget:burn_rate:30m > 6
    )
  annotations:
    summary: "Error budget burn rate too high"
    description: "At current burn rate, error budget exhausted in {{ value }} hours"

SLO Review Process

Weekly Review

1. Current SLI vs SLO
2. Error budget remaining
3. Incidents this week
4. Budget consumption trend
5. Action items

Quarterly Review

1. Was the SLO appropriate?
   - Too tight (constantly breaking)?
   - Too loose (never challenged)?

2. Did the SLI reflect user experience?
   - User complaints vs SLI dips?
   - Gaps in measurement?

3. Were error budgets useful?
   - Did they influence decisions?
   - Were policies followed?

4. Adjustments needed?

Common Mistakes

Too Many SLOs

Bad: 47 SLOs covering every metric
Good: 3-5 SLOs covering critical user journeys

More SLOs means less focus.

Measuring Symptoms, Not Experience

Bad SLO: Database cluster has 3 healthy nodes
Good SLO: Database queries succeed within 100ms

Users don’t care about your node count.

Ignoring Error Budgets

Bad: "Error budget is just a number, we ship anyway"
Good: "We're at 30% budget, let's slow down and fix reliability"

Error budgets only work if they have consequences.

Set and Forget

SLOs need regular review:

Key Takeaways

SLOs work when they reflect user needs and drive actual decisions. Otherwise, they’re just dashboards.