10 KiB
SLO Cookbook: A Practical Guide to Service Level Objectives
Introduction
Service Level Objectives (SLOs) are a key tool for managing service reliability. This cookbook provides practical guidance for implementing SLOs that actually improve system reliability rather than just creating meaningless metrics.
Fundamentals
The SLI/SLO/SLA Hierarchy
- SLI (Service Level Indicator): A quantifiable measure of service quality
- SLO (Service Level Objective): A target range of values for an SLI
- SLA (Service Level Agreement): A business agreement with consequences for missing SLO targets
Golden Rule of SLOs
Start simple, iterate based on learning. Your first SLOs won't be perfect, and that's okay.
Choosing Good SLIs
The Four Golden Signals
- Latency: How long requests take to complete
- Traffic: How many requests are coming in
- Errors: How many requests are failing
- Saturation: How "full" your service is
SLI Selection Criteria
A good SLI should be:
- Measurable: You can collect data for it
- Meaningful: It reflects user experience
- Controllable: You can take action to improve it
- Proportional: Changes in the SLI reflect changes in user happiness
Service Type Specific SLIs
HTTP APIs
- Request latency: P95 or P99 response time
- Availability: Proportion of successful requests (non-5xx)
- Throughput: Requests per second capacity
# Availability SLI
sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Latency SLI
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Batch Jobs
- Freshness: Age of the last successful run
- Correctness: Proportion of jobs completing successfully
- Throughput: Items processed per unit time
Data Pipelines
- Data freshness: Time since last successful update
- Data quality: Proportion of records passing validation
- Processing latency: Time from ingestion to availability
Anti-Patterns in SLI Selection
❌ Don't use: CPU usage, memory usage, disk space as primary SLIs
- These are symptoms, not user-facing impacts
❌ Don't use: Counts instead of rates or proportions
- "Number of errors" vs "Error rate"
❌ Don't use: Internal metrics that users don't care about
- Queue depth, cache hit rate (unless they directly impact user experience)
Setting SLO Targets
The Art of Target Setting
Setting SLO targets is balancing act between:
- User happiness: Targets should reflect acceptable user experience
- Business value: Tighter SLOs cost more to maintain
- Current performance: Targets should be achievable but aspirational
Target Setting Strategies
Historical Performance Method
- Collect 4-6 weeks of historical data
- Calculate the worst user-visible performance in that period
- Set your SLO slightly better than the worst acceptable performance
User Journey Mapping
- Map critical user journeys
- Identify acceptable performance for each step
- Work backwards to component SLOs
Error Budget Approach
- Decide how much unreliability you can afford
- Set SLO targets based on acceptable error budget consumption
- Example: 99.9% availability = 43.8 minutes downtime per month
SLO Target Examples by Service Criticality
Critical Services (Revenue Impact)
- Availability: 99.95% - 99.99%
- Latency (P95): 100-200ms
- Error Rate: < 0.1%
High Priority Services
- Availability: 99.9% - 99.95%
- Latency (P95): 200-500ms
- Error Rate: < 0.5%
Standard Services
- Availability: 99.5% - 99.9%
- Latency (P95): 500ms - 1s
- Error Rate: < 1%
Error Budget Management
What is an Error Budget?
Your error budget is the maximum amount of unreliability you can accumulate while still meeting your SLO. It's calculated as:
Error Budget = (1 - SLO) × Time Window
For a 99.9% availability SLO over 30 days:
Error Budget = (1 - 0.999) × 30 days = 0.001 × 30 days = 43.8 minutes
Error Budget Policies
Define what happens when you consume your error budget:
Conservative Policy (High-Risk Services)
- > 50% consumed: Freeze non-critical feature releases
- > 75% consumed: Focus entirely on reliability improvements
- > 90% consumed: Consider emergency measures (traffic shaping, etc.)
Balanced Policy (Standard Services)
- > 75% consumed: Increase focus on reliability work
- > 90% consumed: Pause feature work, focus on reliability
Aggressive Policy (Early Stage Services)
- > 90% consumed: Review but continue normal operations
- 100% consumed: Evaluate SLO appropriateness
Burn Rate Alerting
Multi-window burn rate alerts help you catch SLO violations before they become critical:
# Fast burn: 2% budget consumed in 1 hour
- alert: FastBurnSLOViolation
expr: (
(1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * 0.001)
and
(1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * 0.001)
)
for: 2m
# Slow burn: 10% budget consumed in 3 days
- alert: SlowBurnSLOViolation
expr: (
(1 - (sum(rate(http_requests_total{code!~"5.."}[6h])) / sum(rate(http_requests_total[6h])))) > (1.0 * 0.001)
and
(1 - (sum(rate(http_requests_total{code!~"5.."}[3d])) / sum(rate(http_requests_total[3d])))) > (1.0 * 0.001)
)
for: 15m
Implementation Patterns
The SLO Implementation Ladder
Level 1: Basic SLOs
- Choose 1-2 SLIs that matter most to users
- Set aspirational but achievable targets
- Implement basic alerting when SLOs are missed
Level 2: Operational SLOs
- Add burn rate alerting
- Create error budget dashboards
- Establish error budget policies
- Regular SLO review meetings
Level 3: Advanced SLOs
- Multi-window burn rate alerts
- Automated error budget policy enforcement
- SLO-driven incident prioritization
- Integration with CI/CD for deployment decisions
SLO Measurement Architecture
Push vs Pull Metrics
- Pull (Prometheus): Good for infrastructure metrics, real-time alerting
- Push (StatsD): Good for application metrics, business events
Measurement Points
- Server-side: More reliable, easier to implement
- Client-side: Better reflects user experience
- Synthetic: Consistent, predictable, may not reflect real user experience
SLO Dashboard Design
Essential elements for SLO dashboards:
- Current SLO Achievement: Large, prominent display
- Error Budget Remaining: Visual indicator (gauge, progress bar)
- Burn Rate: Time series showing error budget consumption rate
- Historical Trends: 4-week view of SLO achievement
- Alerts: Current and recent SLO-related alerts
Advanced Topics
Dependency SLOs
For services with dependencies:
SLO_service ≤ min(SLO_inherent, ∏SLO_dependencies)
If your service depends on 3 other services each with 99.9% SLO:
Maximum_SLO = 0.999³ = 0.997 = 99.7%
User Journey SLOs
Track end-to-end user experiences:
# Registration success rate
sum(rate(user_registration_success_total[5m])) / sum(rate(user_registration_attempts_total[5m]))
# Purchase completion latency
histogram_quantile(0.95, rate(purchase_completion_duration_seconds_bucket[5m]))
SLOs for Batch Systems
Special considerations for non-request/response systems:
Freshness SLO
# Data should be no more than 4 hours old
(time() - last_successful_update_timestamp) < (4 * 3600)
Throughput SLO
# Should process at least 1000 items per hour
rate(items_processed_total[1h]) >= 1000
Quality SLO
# At least 99.5% of records should pass validation
sum(rate(records_valid_total[5m])) / sum(rate(records_processed_total[5m])) >= 0.995
Common Mistakes and How to Avoid Them
Mistake 1: Too Many SLOs
Problem: Drowning in metrics, losing focus Solution: Start with 1-2 SLOs per service, add more only when needed
Mistake 2: Internal Metrics as SLIs
Problem: Optimizing for metrics that don't impact users Solution: Always ask "If this metric changes, do users notice?"
Mistake 3: Perfectionist SLOs
Problem: 99.99% SLO when 99.9% would be fine Solution: Higher SLOs cost exponentially more; pick the minimum acceptable level
Mistake 4: Ignoring Error Budgets
Problem: Treating any SLO miss as an emergency Solution: Error budgets exist to be spent; use them to balance feature velocity and reliability
Mistake 5: Static SLOs
Problem: Setting SLOs once and never updating them Solution: Review SLOs quarterly; adjust based on user feedback and business changes
SLO Review Process
Monthly SLO Review Agenda
- SLO Achievement Review: Did we meet our SLOs?
- Error Budget Analysis: How did we spend our error budget?
- Incident Correlation: Which incidents impacted our SLOs?
- SLI Quality Assessment: Are our SLIs still meaningful?
- Target Adjustment: Should we change any targets?
Quarterly SLO Health Check
- User Impact Validation: Survey users about acceptable performance
- Business Alignment: Do SLOs still reflect business priorities?
- Measurement Quality: Are we measuring the right things?
- Cost/Benefit Analysis: Are tighter SLOs worth the investment?
Tooling and Automation
Essential Tools
- Metrics Collection: Prometheus, InfluxDB, CloudWatch
- Alerting: Alertmanager, PagerDuty, OpsGenie
- Dashboards: Grafana, DataDog, New Relic
- SLO Platforms: Sloth, Pyrra, Service Level Blue
Automation Opportunities
- Burn rate alert generation from SLO definitions
- Dashboard creation from SLO specifications
- Error budget calculation and tracking
- Release blocking based on error budget consumption
Getting Started Checklist
- Identify your service's critical user journeys
- Choose 1-2 SLIs that best reflect user experience
- Collect 4-6 weeks of baseline data
- Set initial SLO targets based on historical performance
- Implement basic SLO monitoring and alerting
- Create an SLO dashboard
- Define error budget policies
- Schedule monthly SLO reviews
- Plan for quarterly SLO health checks
Remember: SLOs are a journey, not a destination. Start simple, learn from experience, and iterate toward better reliability management.