Service Level Objectives (SLOs) have become standard practice, but many implementations miss the point. Teams set targets that don’t reflect user experience, create error budgets they don’t use, and generate dashboards nobody watches.
Here’s how to implement SLOs that actually improve reliability.
Choosing the Right SLIs
Measure What Users Experience
Bad SLI: Server CPU utilization
Good SLI: Request success rate as seen by users
Bad SLI: Database query latency
Good SLI: Page load time as experienced by users
Bad SLI: Pod health status
Good SLI: Successful checkout completion rate
Users don’t care about your infrastructure metrics.
The Four Golden Signals
For most services:
Availability: Percentage of successful requests
SLI = successful_requests / total_requests
Latency: Response time distribution
SLI = requests_under_threshold / total_requests
Throughput: Requests handled per second
SLI = requests_per_second (capacity indicator)
Error Rate: Percentage of errors
SLI = error_requests / total_requests
User Journey SLIs
Beyond technical metrics, measure user journeys:
Checkout Journey SLI:
- Cart loads: 99.9%
- Payment processed: 99.5%
- Order confirmed: 99.9%
- Confirmation email: 99.0%
Combined journey success: 99.9% × 99.5% × 99.9% × 99.0% = 98.3%
Setting Realistic Targets
Start with Reality
Measure current performance before setting targets:
Week 1-4: Measure current availability
Result: 99.7% average, 99.2% worst week
Target options:
- 99.9% (aspirational, requires investment)
- 99.5% (achievable with current system)
- 99.7% (matches current reality)
Don’t set 99.99% if you can’t meet 99.9%.
Consider the User
Different services need different reliability:
| Service | SLO | Rationale |
|---|---|---|
| Payment processing | 99.95% | High business impact |
| Product catalog | 99.5% | Degraded experience acceptable |
| Admin dashboard | 99.0% | Internal, lower priority |
| Analytics | 95.0% | Async, can catch up |
Account for Dependencies
If Service A calls Service B calls Service C:
- Service C: 99.9% availability
- Service B: 99.9% × 99.9% = 99.8%
- Service A: 99.8% × 99.9% = 99.7%
Service A can't be more reliable than its dependencies.
Error Budgets That Work
Calculating Error Budgets
SLO: 99.9% availability
Error budget: 0.1% = 43.2 minutes per month
SLO: 99.5% availability
Error budget: 0.5% = 3.6 hours per month
SLO: 99.0% availability
Error budget: 1.0% = 7.3 hours per month
Using Error Budgets
Error budgets are a policy tool:
Budget Status → Actions
Healthy (>50%):
- Ship features normally
- Experiment with new approaches
- Accept calculated risks
Depleted (25-50%):
- Slow down feature velocity
- Focus on reliability improvements
- More careful change management
Critical (<25%):
- Feature freeze
- All hands on reliability
- Post-mortem on consumption
Exhausted (0%):
- Complete freeze
- Emergency reliability work only
Error Budget Burns
Track how fast budget is consumed:
Weekly burn rate = errors_this_week / weekly_budget
If burn rate > 1: Consuming faster than budget allows
If burn rate < 0.5: Could ship faster
If burn rate ≈ 1: Sustainable pace
Alert on burn rate, not just budget remaining.
Implementation
Monitoring Setup
# Prometheus recording rules
groups:
- name: slos
rules:
- record: slo:requests:success_rate
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- record: slo:latency:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m]))
by (le)
)
Dashboards
SLO Dashboard Layout:
┌────────────────────────────────────────┐
│ Current Availability: 99.87% │
│ Target: 99.9% │
│ Status: ✓ Meeting SLO │
├────────────────────────────────────────┤
│ Error Budget Remaining: 67% │
│ ████████████░░░░░░░░ │
│ Burn Rate: 0.8x (sustainable) │
├────────────────────────────────────────┤
│ 30-Day Trend │
│ [availability over time graph] │
└────────────────────────────────────────┘
Alerting
Alert on error budget burn, not instantaneous SLI:
# Multi-window burn rate alert
- alert: SLOBurnRateTooHigh
expr: |
(
slo:error_budget:burn_rate:1h > 14.4
and
slo:error_budget:burn_rate:5m > 14.4
) or (
slo:error_budget:burn_rate:6h > 6
and
slo:error_budget:burn_rate:30m > 6
)
annotations:
summary: "Error budget burn rate too high"
description: "At current burn rate, error budget exhausted in {{ value }} hours"
SLO Review Process
Weekly Review
1. Current SLI vs SLO
2. Error budget remaining
3. Incidents this week
4. Budget consumption trend
5. Action items
Quarterly Review
1. Was the SLO appropriate?
- Too tight (constantly breaking)?
- Too loose (never challenged)?
2. Did the SLI reflect user experience?
- User complaints vs SLI dips?
- Gaps in measurement?
3. Were error budgets useful?
- Did they influence decisions?
- Were policies followed?
4. Adjustments needed?
Common Mistakes
Too Many SLOs
Bad: 47 SLOs covering every metric
Good: 3-5 SLOs covering critical user journeys
More SLOs means less focus.
Measuring Symptoms, Not Experience
Bad SLO: Database cluster has 3 healthy nodes
Good SLO: Database queries succeed within 100ms
Users don’t care about your node count.
Ignoring Error Budgets
Bad: "Error budget is just a number, we ship anyway"
Good: "We're at 30% budget, let's slow down and fix reliability"
Error budgets only work if they have consequences.
Set and Forget
SLOs need regular review:
- Business requirements change
- Systems evolve
- Users expectations shift
Key Takeaways
- Measure what users experience, not infrastructure metrics
- Start with current reality; don’t set aspirational targets you can’t meet
- Different services need different reliability levels
- Error budgets are policy tools; define actions for each budget state
- Alert on burn rate, not instantaneous SLI dips
- Keep SLOs focused (3-5 per service)
- Review SLOs quarterly and adjust as needed
SLOs work when they reflect user needs and drive actual decisions. Otherwise, they’re just dashboards.