# SLO Cookbook: A Practical Guide to Service Level Objectives ## Introduction Service Level Objectives (SLOs) are a key tool for managing service reliability. This cookbook provides practical guidance for implementing SLOs that actually improve system reliability rather than just creating meaningless metrics. ## Fundamentals ### The SLI/SLO/SLA Hierarchy - **SLI (Service Level Indicator)**: A quantifiable measure of service quality - **SLO (Service Level Objective)**: A target range of values for an SLI - **SLA (Service Level Agreement)**: A business agreement with consequences for missing SLO targets ### Golden Rule of SLOs **Start simple, iterate based on learning.** Your first SLOs won't be perfect, and that's okay. ## Choosing Good SLIs ### The Four Golden Signals 1. **Latency**: How long requests take to complete 2. **Traffic**: How many requests are coming in 3. **Errors**: How many requests are failing 4. **Saturation**: How "full" your service is ### SLI Selection Criteria A good SLI should be: - **Measurable**: You can collect data for it - **Meaningful**: It reflects user experience - **Controllable**: You can take action to improve it - **Proportional**: Changes in the SLI reflect changes in user happiness ### Service Type Specific SLIs #### HTTP APIs - **Request latency**: P95 or P99 response time - **Availability**: Proportion of successful requests (non-5xx) - **Throughput**: Requests per second capacity ```prometheus # Availability SLI sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Latency SLI histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) ``` #### Batch Jobs - **Freshness**: Age of the last successful run - **Correctness**: Proportion of jobs completing successfully - **Throughput**: Items processed per unit time #### Data Pipelines - **Data freshness**: Time since last successful update - **Data quality**: Proportion of records passing validation - **Processing latency**: Time from ingestion to availability ### Anti-Patterns in SLI Selection ❌ **Don't use**: CPU usage, memory usage, disk space as primary SLIs - These are symptoms, not user-facing impacts ❌ **Don't use**: Counts instead of rates or proportions - "Number of errors" vs "Error rate" ❌ **Don't use**: Internal metrics that users don't care about - Queue depth, cache hit rate (unless they directly impact user experience) ## Setting SLO Targets ### The Art of Target Setting Setting SLO targets is balancing act between: - **User happiness**: Targets should reflect acceptable user experience - **Business value**: Tighter SLOs cost more to maintain - **Current performance**: Targets should be achievable but aspirational ### Target Setting Strategies #### Historical Performance Method 1. Collect 4-6 weeks of historical data 2. Calculate the worst user-visible performance in that period 3. Set your SLO slightly better than the worst acceptable performance #### User Journey Mapping 1. Map critical user journeys 2. Identify acceptable performance for each step 3. Work backwards to component SLOs #### Error Budget Approach 1. Decide how much unreliability you can afford 2. Set SLO targets based on acceptable error budget consumption 3. Example: 99.9% availability = 43.8 minutes downtime per month ### SLO Target Examples by Service Criticality #### Critical Services (Revenue Impact) - **Availability**: 99.95% - 99.99% - **Latency (P95)**: 100-200ms - **Error Rate**: < 0.1% #### High Priority Services - **Availability**: 99.9% - 99.95% - **Latency (P95)**: 200-500ms - **Error Rate**: < 0.5% #### Standard Services - **Availability**: 99.5% - 99.9% - **Latency (P95)**: 500ms - 1s - **Error Rate**: < 1% ## Error Budget Management ### What is an Error Budget? Your error budget is the maximum amount of unreliability you can accumulate while still meeting your SLO. It's calculated as: ``` Error Budget = (1 - SLO) × Time Window ``` For a 99.9% availability SLO over 30 days: ``` Error Budget = (1 - 0.999) × 30 days = 0.001 × 30 days = 43.8 minutes ``` ### Error Budget Policies Define what happens when you consume your error budget: #### Conservative Policy (High-Risk Services) - **> 50% consumed**: Freeze non-critical feature releases - **> 75% consumed**: Focus entirely on reliability improvements - **> 90% consumed**: Consider emergency measures (traffic shaping, etc.) #### Balanced Policy (Standard Services) - **> 75% consumed**: Increase focus on reliability work - **> 90% consumed**: Pause feature work, focus on reliability #### Aggressive Policy (Early Stage Services) - **> 90% consumed**: Review but continue normal operations - **100% consumed**: Evaluate SLO appropriateness ### Burn Rate Alerting Multi-window burn rate alerts help you catch SLO violations before they become critical: ```yaml # Fast burn: 2% budget consumed in 1 hour - alert: FastBurnSLOViolation expr: ( (1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * 0.001) and (1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * 0.001) ) for: 2m # Slow burn: 10% budget consumed in 3 days - alert: SlowBurnSLOViolation expr: ( (1 - (sum(rate(http_requests_total{code!~"5.."}[6h])) / sum(rate(http_requests_total[6h])))) > (1.0 * 0.001) and (1 - (sum(rate(http_requests_total{code!~"5.."}[3d])) / sum(rate(http_requests_total[3d])))) > (1.0 * 0.001) ) for: 15m ``` ## Implementation Patterns ### The SLO Implementation Ladder #### Level 1: Basic SLOs - Choose 1-2 SLIs that matter most to users - Set aspirational but achievable targets - Implement basic alerting when SLOs are missed #### Level 2: Operational SLOs - Add burn rate alerting - Create error budget dashboards - Establish error budget policies - Regular SLO review meetings #### Level 3: Advanced SLOs - Multi-window burn rate alerts - Automated error budget policy enforcement - SLO-driven incident prioritization - Integration with CI/CD for deployment decisions ### SLO Measurement Architecture #### Push vs Pull Metrics - **Pull** (Prometheus): Good for infrastructure metrics, real-time alerting - **Push** (StatsD): Good for application metrics, business events #### Measurement Points - **Server-side**: More reliable, easier to implement - **Client-side**: Better reflects user experience - **Synthetic**: Consistent, predictable, may not reflect real user experience ### SLO Dashboard Design Essential elements for SLO dashboards: 1. **Current SLO Achievement**: Large, prominent display 2. **Error Budget Remaining**: Visual indicator (gauge, progress bar) 3. **Burn Rate**: Time series showing error budget consumption rate 4. **Historical Trends**: 4-week view of SLO achievement 5. **Alerts**: Current and recent SLO-related alerts ## Advanced Topics ### Dependency SLOs For services with dependencies: ``` SLO_service ≤ min(SLO_inherent, ∏SLO_dependencies) ``` If your service depends on 3 other services each with 99.9% SLO: ``` Maximum_SLO = 0.999³ = 0.997 = 99.7% ``` ### User Journey SLOs Track end-to-end user experiences: ```prometheus # Registration success rate sum(rate(user_registration_success_total[5m])) / sum(rate(user_registration_attempts_total[5m])) # Purchase completion latency histogram_quantile(0.95, rate(purchase_completion_duration_seconds_bucket[5m])) ``` ### SLOs for Batch Systems Special considerations for non-request/response systems: #### Freshness SLO ```prometheus # Data should be no more than 4 hours old (time() - last_successful_update_timestamp) < (4 * 3600) ``` #### Throughput SLO ```prometheus # Should process at least 1000 items per hour rate(items_processed_total[1h]) >= 1000 ``` #### Quality SLO ```prometheus # At least 99.5% of records should pass validation sum(rate(records_valid_total[5m])) / sum(rate(records_processed_total[5m])) >= 0.995 ``` ## Common Mistakes and How to Avoid Them ### Mistake 1: Too Many SLOs **Problem**: Drowning in metrics, losing focus **Solution**: Start with 1-2 SLOs per service, add more only when needed ### Mistake 2: Internal Metrics as SLIs **Problem**: Optimizing for metrics that don't impact users **Solution**: Always ask "If this metric changes, do users notice?" ### Mistake 3: Perfectionist SLOs **Problem**: 99.99% SLO when 99.9% would be fine **Solution**: Higher SLOs cost exponentially more; pick the minimum acceptable level ### Mistake 4: Ignoring Error Budgets **Problem**: Treating any SLO miss as an emergency **Solution**: Error budgets exist to be spent; use them to balance feature velocity and reliability ### Mistake 5: Static SLOs **Problem**: Setting SLOs once and never updating them **Solution**: Review SLOs quarterly; adjust based on user feedback and business changes ## SLO Review Process ### Monthly SLO Review Agenda 1. **SLO Achievement Review**: Did we meet our SLOs? 2. **Error Budget Analysis**: How did we spend our error budget? 3. **Incident Correlation**: Which incidents impacted our SLOs? 4. **SLI Quality Assessment**: Are our SLIs still meaningful? 5. **Target Adjustment**: Should we change any targets? ### Quarterly SLO Health Check 1. **User Impact Validation**: Survey users about acceptable performance 2. **Business Alignment**: Do SLOs still reflect business priorities? 3. **Measurement Quality**: Are we measuring the right things? 4. **Cost/Benefit Analysis**: Are tighter SLOs worth the investment? ## Tooling and Automation ### Essential Tools 1. **Metrics Collection**: Prometheus, InfluxDB, CloudWatch 2. **Alerting**: Alertmanager, PagerDuty, OpsGenie 3. **Dashboards**: Grafana, DataDog, New Relic 4. **SLO Platforms**: Sloth, Pyrra, Service Level Blue ### Automation Opportunities - **Burn rate alert generation** from SLO definitions - **Dashboard creation** from SLO specifications - **Error budget calculation** and tracking - **Release blocking** based on error budget consumption ## Getting Started Checklist - [ ] Identify your service's critical user journeys - [ ] Choose 1-2 SLIs that best reflect user experience - [ ] Collect 4-6 weeks of baseline data - [ ] Set initial SLO targets based on historical performance - [ ] Implement basic SLO monitoring and alerting - [ ] Create an SLO dashboard - [ ] Define error budget policies - [ ] Schedule monthly SLO reviews - [ ] Plan for quarterly SLO health checks Remember: SLOs are a journey, not a destination. Start simple, learn from experience, and iterate toward better reliability management.