add brain

This commit is contained in:
2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions

View File

@@ -0,0 +1,469 @@
# Alert Design Patterns: A Guide to Effective Alerting
## Introduction
Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue.
## Fundamental Principles
### The Golden Rules of Alerting
1. **Every alert should be actionable** - If you can't do something about it, don't alert
2. **Every alert should require human intelligence** - If a script can handle it, automate the response
3. **Every alert should be novel** - Don't alert on known, ongoing issues
4. **Every alert should represent a user-visible impact** - Internal metrics matter only if users are affected
### Alert Classification
#### Critical Alerts
- Service is completely down
- Data loss is occurring
- Security breach detected
- SLO burn rate indicates imminent SLO violation
#### Warning Alerts
- Service degradation affecting some users
- Approaching resource limits
- Dependent service issues
- Elevated error rates within SLO
#### Info Alerts
- Deployment notifications
- Capacity planning triggers
- Configuration changes
- Maintenance windows
## Alert Design Patterns
### Pattern 1: Symptoms, Not Causes
**Good**: Alert on user-visible symptoms
```yaml
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
annotations:
summary: "API latency is high"
description: "95th percentile latency is {{ $value }}s, above 500ms threshold"
```
**Bad**: Alert on internal metrics that may not affect users
```yaml
- alert: HighCPU
expr: cpu_usage > 80
# This might not affect users at all!
```
### Pattern 2: Multi-Window Alerting
Reduce false positives by requiring sustained problems:
```yaml
- alert: ServiceDown
expr: (
avg_over_time(up[2m]) == 0 # Short window: immediate detection
and
avg_over_time(up[10m]) < 0.8 # Long window: avoid flapping
)
for: 1m
```
### Pattern 3: Burn Rate Alerting
Alert based on error budget consumption rate:
```yaml
# Fast burn: 2% of monthly budget in 1 hour
- alert: ErrorBudgetFastBurn
expr: (
error_rate_5m > (14.4 * error_budget_slo)
and
error_rate_1h > (14.4 * error_budget_slo)
)
for: 2m
labels:
severity: critical
# Slow burn: 10% of monthly budget in 3 days
- alert: ErrorBudgetSlowBurn
expr: (
error_rate_6h > (1.0 * error_budget_slo)
and
error_rate_3d > (1.0 * error_budget_slo)
)
for: 15m
labels:
severity: warning
```
### Pattern 4: Hysteresis
Use different thresholds for firing and resolving to prevent flapping:
```yaml
- alert: HighErrorRate
expr: error_rate > 0.05 # Fire at 5%
for: 5m
# Resolution happens automatically when error_rate < 0.03 (3%)
# This prevents flapping around the 5% threshold
```
### Pattern 5: Composite Alerts
Alert when multiple conditions indicate a problem:
```yaml
- alert: ServiceDegraded
expr: (
(latency_p95 > latency_threshold)
or
(error_rate > error_threshold)
or
(availability < availability_threshold)
) and (
request_rate > min_request_rate # Only alert if we have traffic
)
```
### Pattern 6: Contextual Alerting
Include relevant context in alerts:
```yaml
- alert: DatabaseConnections
expr: db_connections_active / db_connections_max > 0.8
for: 5m
annotations:
summary: "Database connection pool nearly exhausted"
description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization"
runbook_url: "https://runbooks.company.com/database-connections"
impact: "New requests may be rejected, causing 500 errors"
suggested_action: "Check for connection leaks or increase pool size"
```
## Alert Routing and Escalation
### Routing by Impact and Urgency
#### Critical Path Services
```yaml
route:
group_by: ['service']
routes:
- match:
service: 'payment-api'
severity: 'critical'
receiver: 'payment-team-pager'
continue: true
- match:
service: 'payment-api'
severity: 'warning'
receiver: 'payment-team-slack'
```
#### Time-Based Routing
```yaml
route:
routes:
- match:
severity: 'critical'
receiver: 'oncall-pager'
- match:
severity: 'warning'
time: 'business_hours' # 9 AM - 5 PM
receiver: 'team-slack'
- match:
severity: 'warning'
time: 'after_hours'
receiver: 'team-email' # Lower urgency outside business hours
```
### Escalation Patterns
#### Linear Escalation
```yaml
receivers:
- name: 'primary-oncall'
pagerduty_configs:
- escalation_policy: 'P1-Escalation'
# 0 min: Primary on-call
# 5 min: Secondary on-call
# 15 min: Engineering manager
# 30 min: Director of engineering
```
#### Severity-Based Escalation
```yaml
# Critical: Immediate escalation
- match:
severity: 'critical'
receiver: 'critical-escalation'
# Warning: Team-first escalation
- match:
severity: 'warning'
receiver: 'team-escalation'
```
## Alert Fatigue Prevention
### Grouping and Suppression
#### Time-Based Grouping
```yaml
route:
group_wait: 30s # Wait 30s to group similar alerts
group_interval: 2m # Send grouped alerts every 2 minutes
repeat_interval: 1h # Re-send unresolved alerts every hour
```
#### Dependent Service Suppression
```yaml
- alert: ServiceDown
expr: up == 0
- alert: HighLatency
expr: latency_p95 > 1
# This alert is suppressed when ServiceDown is firing
inhibit_rules:
- source_match:
alertname: 'ServiceDown'
target_match:
alertname: 'HighLatency'
equal: ['service']
```
### Alert Throttling
```yaml
# Limit to 1 alert per 10 minutes for noisy conditions
- alert: HighMemoryUsage
expr: memory_usage_percent > 85
for: 10m # Longer 'for' duration reduces noise
annotations:
summary: "Memory usage has been high for 10+ minutes"
```
### Smart Defaults
```yaml
# Use business logic to set intelligent thresholds
- alert: LowTraffic
expr: request_rate < (
avg_over_time(request_rate[7d]) * 0.1 # 10% of weekly average
)
# Only alert during business hours when low traffic is unusual
for: 30m
```
## Runbook Integration
### Runbook Structure Template
```markdown
# Alert: {{ $labels.alertname }}
## Immediate Actions
1. Check service status dashboard
2. Verify if users are affected
3. Look at recent deployments/changes
## Investigation Steps
1. Check logs for errors in the last 30 minutes
2. Verify dependent services are healthy
3. Check resource utilization (CPU, memory, disk)
4. Review recent alerts for patterns
## Resolution Actions
- If deployment-related: Consider rollback
- If resource-related: Scale up or optimize queries
- If dependency-related: Engage appropriate team
## Escalation
- Primary: @team-oncall
- Secondary: @engineering-manager
- Emergency: @site-reliability-team
```
### Runbook Integration in Alerts
```yaml
annotations:
runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}"
quick_debug: |
1. curl -s https://{{ $labels.instance }}/health
2. kubectl logs {{ $labels.pod }} --tail=50
3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }}
```
## Testing and Validation
### Alert Testing Strategies
#### Chaos Engineering Integration
```python
# Test that alerts fire during controlled failures
def test_alert_during_cpu_spike():
with chaos.cpu_spike(target='payment-api', duration='2m'):
assert wait_for_alert('HighCPU', timeout=180)
def test_alert_during_network_partition():
with chaos.network_partition(target='database'):
assert wait_for_alert('DatabaseUnreachable', timeout=60)
```
#### Historical Alert Analysis
```prometheus
# Query to find alerts that fired without incidents
count by (alertname) (
ALERTS{alertstate="firing"}[30d]
) unless on (alertname) (
count by (alertname) (
incident_created{source="alert"}[30d]
)
)
```
### Alert Quality Metrics
#### Alert Precision
```
Precision = True Positives / (True Positives + False Positives)
```
Track alerts that resulted in actual incidents vs false alarms.
#### Time to Resolution
```prometheus
# Average time from alert firing to resolution
avg_over_time(
(alert_resolved_timestamp - alert_fired_timestamp)[30d]
) by (alertname)
```
#### Alert Fatigue Indicators
```prometheus
# Alerts per day by team
sum by (team) (
increase(alerts_fired_total[1d])
)
# Percentage of alerts acknowledged within 15 minutes
sum(alerts_acked_within_15m) / sum(alerts_fired) * 100
```
## Advanced Patterns
### Machine Learning-Enhanced Alerting
#### Anomaly Detection
```yaml
- alert: AnomalousTraffic
expr: |
abs(request_rate - predict_linear(request_rate[1h], 300)) /
stddev_over_time(request_rate[1h]) > 3
for: 10m
annotations:
summary: "Traffic pattern is anomalous"
description: "Current traffic deviates from predicted pattern by >3 standard deviations"
```
#### Dynamic Thresholds
```yaml
- alert: DynamicHighLatency
expr: |
latency_p95 > (
quantile_over_time(0.95, latency_p95[7d]) + # Historical 95th percentile
2 * stddev_over_time(latency_p95[7d]) # Plus 2 standard deviations
)
```
### Business Hours Awareness
```yaml
# Different thresholds for business vs off hours
- alert: HighLatencyBusinessHours
expr: latency_p95 > 0.2 # Stricter during business hours
for: 2m
# Active 9 AM - 5 PM weekdays
- alert: HighLatencyOffHours
expr: latency_p95 > 0.5 # More lenient after hours
for: 5m
# Active nights and weekends
```
### Progressive Alerting
```yaml
# Escalating alert severity based on duration
- alert: ServiceLatencyElevated
expr: latency_p95 > 0.5
for: 5m
labels:
severity: info
- alert: ServiceLatencyHigh
expr: latency_p95 > 0.5
for: 15m # Same condition, longer duration
labels:
severity: warning
- alert: ServiceLatencyCritical
expr: latency_p95 > 0.5
for: 30m # Same condition, even longer duration
labels:
severity: critical
```
## Anti-Patterns to Avoid
### Anti-Pattern 1: Alerting on Everything
**Problem**: Too many alerts create noise and fatigue
**Solution**: Be selective; only alert on user-impacting issues
### Anti-Pattern 2: Vague Alert Messages
**Problem**: "Service X is down" - which instance? what's the impact?
**Solution**: Include specific details and context
### Anti-Pattern 3: Alerts Without Runbooks
**Problem**: Alerts that don't explain what to do
**Solution**: Every alert must have an associated runbook
### Anti-Pattern 4: Static Thresholds
**Problem**: 80% CPU might be normal during peak hours
**Solution**: Use contextual, adaptive thresholds
### Anti-Pattern 5: Ignoring Alert Quality
**Problem**: Accepting high false positive rates
**Solution**: Regularly review and tune alert precision
## Implementation Checklist
### Pre-Implementation
- [ ] Define alert severity levels and escalation policies
- [ ] Create runbook templates
- [ ] Set up alert routing configuration
- [ ] Define SLOs that alerts will protect
### Alert Development
- [ ] Each alert has clear success criteria
- [ ] Alert conditions tested against historical data
- [ ] Runbook created and accessible
- [ ] Severity and routing configured
- [ ] Context and suggested actions included
### Post-Implementation
- [ ] Monitor alert precision and recall
- [ ] Regular review of alert fatigue metrics
- [ ] Quarterly alert effectiveness review
- [ ] Team training on alert response procedures
### Quality Assurance
- [ ] Test alerts fire during controlled failures
- [ ] Verify alerts resolve when conditions improve
- [ ] Confirm runbooks are accurate and helpful
- [ ] Validate escalation paths work correctly
Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM.

View File

@@ -0,0 +1,571 @@
# Dashboard Best Practices: Design for Insight and Action
## Introduction
A well-designed dashboard is like a good story - it guides you through the data with purpose and clarity. This guide provides practical patterns for creating dashboards that inform decisions and enable quick troubleshooting.
## Design Principles
### The Hierarchy of Information
#### Primary Information (Top Third)
- Service health status
- SLO achievement
- Critical alerts
- Business KPIs
#### Secondary Information (Middle Third)
- Golden signals (latency, traffic, errors, saturation)
- Resource utilization
- Throughput and performance metrics
#### Tertiary Information (Bottom Third)
- Detailed breakdowns
- Historical trends
- Dependency status
- Debug information
### Visual Design Principles
#### Rule of 7±2
- Maximum 7±2 panels per screen
- Group related information together
- Use sections to organize complexity
#### Color Psychology
- **Red**: Critical issues, danger, immediate attention needed
- **Yellow/Orange**: Warnings, caution, degraded state
- **Green**: Healthy, normal operation, success
- **Blue**: Information, neutral metrics, capacity
- **Gray**: Disabled, unknown, or baseline states
#### Chart Selection Guide
- **Line charts**: Time series, trends, comparisons over time
- **Bar charts**: Categorical comparisons, top N lists
- **Gauges**: Single value with defined good/bad ranges
- **Stat panels**: Key metrics, percentages, counts
- **Heatmaps**: Distribution data, correlation analysis
- **Tables**: Detailed breakdowns, multi-dimensional data
## Dashboard Archetypes
### The Overview Dashboard
**Purpose**: High-level health check and business metrics
**Audience**: Executives, managers, cross-team stakeholders
**Update Frequency**: 5-15 minutes
```yaml
sections:
- title: "Business Health"
panels:
- service_availability_summary
- revenue_per_hour
- active_users
- conversion_rate
- title: "System Health"
panels:
- critical_alerts_count
- slo_achievement_summary
- error_budget_remaining
- deployment_status
```
### The SRE Operational Dashboard
**Purpose**: Real-time monitoring and incident response
**Audience**: SRE, on-call engineers
**Update Frequency**: 15-30 seconds
```yaml
sections:
- title: "Service Status"
panels:
- service_up_status
- active_incidents
- recent_deployments
- title: "Golden Signals"
panels:
- latency_percentiles
- request_rate
- error_rate
- resource_saturation
- title: "Infrastructure"
panels:
- cpu_memory_utilization
- network_io
- disk_space
```
### The Developer Debug Dashboard
**Purpose**: Deep-dive troubleshooting and performance analysis
**Audience**: Development teams
**Update Frequency**: 30 seconds - 2 minutes
```yaml
sections:
- title: "Application Performance"
panels:
- endpoint_latency_breakdown
- database_query_performance
- cache_hit_rates
- queue_depths
- title: "Errors and Logs"
panels:
- error_rate_by_endpoint
- log_volume_by_level
- exception_types
- slow_queries
```
## Layout Patterns
### The F-Pattern Layout
Based on eye-tracking studies, users scan in an F-pattern:
```
[Critical Status] [SLO Summary ] [Error Budget ]
[Latency ] [Traffic ] [Errors ]
[Saturation ] [Resource Use ] [Detailed View]
[Historical ] [Dependencies ] [Debug Info ]
```
### The Z-Pattern Layout
For executive dashboards, follow the Z-pattern:
```
[Business KPIs ] → [System Status]
↓ ↓
[Trend Analysis ] ← [Key Metrics ]
```
### Responsive Design
#### Desktop (1920x1080)
- 24-column grid
- Panels can be 6, 8, 12, or 24 units wide
- 4-6 rows visible without scrolling
#### Laptop (1366x768)
- Stack wider panels vertically
- Reduce panel heights
- Prioritize most critical information
#### Mobile (768px width)
- Single column layout
- Simplified panels
- Touch-friendly controls
## Effective Panel Design
### Stat Panels
```yaml
# Good: Clear value with context
- title: "API Availability"
type: stat
targets:
- expr: avg(up{service="api"}) * 100
field_config:
unit: percent
thresholds:
steps:
- color: red
value: 0
- color: yellow
value: 99
- color: green
value: 99.9
options:
color_mode: background
text_mode: value_and_name
```
### Time Series Panels
```yaml
# Good: Multiple related metrics with clear legend
- title: "Request Latency"
type: timeseries
targets:
- expr: histogram_quantile(0.50, rate(http_duration_bucket[5m]))
legend: "P50"
- expr: histogram_quantile(0.95, rate(http_duration_bucket[5m]))
legend: "P95"
- expr: histogram_quantile(0.99, rate(http_duration_bucket[5m]))
legend: "P99"
field_config:
unit: ms
custom:
draw_style: line
fill_opacity: 10
options:
legend:
display_mode: table
placement: bottom
values: [min, max, mean, last]
```
### Table Panels
```yaml
# Good: Top N with relevant columns
- title: "Slowest Endpoints"
type: table
targets:
- expr: topk(10, histogram_quantile(0.95, sum by (handler)(rate(http_duration_bucket[5m]))))
format: table
instant: true
transformations:
- id: organize
options:
exclude_by_name:
Time: true
rename_by_name:
Value: "P95 Latency (ms)"
handler: "Endpoint"
```
## Color and Visualization Best Practices
### Threshold Configuration
```yaml
# Traffic light system with meaningful boundaries
thresholds:
steps:
- color: green # Good performance
value: null # Default
- color: yellow # Degraded performance
value: 95 # 95th percentile of historical normal
- color: orange # Poor performance
value: 99 # 99th percentile of historical normal
- color: red # Critical performance
value: 99.9 # Worst case scenario
```
### Color Blind Friendly Palettes
```yaml
# Use patterns and shapes in addition to color
field_config:
overrides:
- matcher:
id: byName
options: "Critical"
properties:
- id: color
value:
mode: fixed
fixed_color: "#d73027" # Red-orange for protanopia
- id: custom.draw_style
value: "points" # Different shape
```
### Consistent Color Semantics
- **Success/Health**: Green (#28a745)
- **Warning/Degraded**: Yellow (#ffc107)
- **Error/Critical**: Red (#dc3545)
- **Information**: Blue (#007bff)
- **Neutral**: Gray (#6c757d)
## Time Range Strategy
### Default Time Ranges by Dashboard Type
#### Real-time Operational
- **Default**: Last 15 minutes
- **Quick options**: 5m, 15m, 1h, 4h
- **Auto-refresh**: 15-30 seconds
#### Troubleshooting
- **Default**: Last 1 hour
- **Quick options**: 15m, 1h, 4h, 12h, 1d
- **Auto-refresh**: 1 minute
#### Business Review
- **Default**: Last 24 hours
- **Quick options**: 1d, 7d, 30d, 90d
- **Auto-refresh**: 5 minutes
#### Capacity Planning
- **Default**: Last 7 days
- **Quick options**: 7d, 30d, 90d, 1y
- **Auto-refresh**: 15 minutes
### Time Range Annotations
```yaml
# Add context for time-based events
annotations:
- name: "Deployments"
datasource: "Prometheus"
expr: "deployment_timestamp"
title_format: "Deploy {{ version }}"
text_format: "Deployed version {{ version }} to {{ environment }}"
- name: "Incidents"
datasource: "Incident API"
query: "incidents.json?service={{ service }}"
color: "red"
```
## Interactive Features
### Template Variables
```yaml
# Service selector
- name: service
type: query
query: label_values(up, service)
current:
text: All
value: $__all
include_all: true
multi: true
# Environment selector
- name: environment
type: query
query: label_values(up{service="$service"}, environment)
current:
text: production
value: production
```
### Drill-Down Links
```yaml
# Panel-level drill-downs
- title: "Error Rate"
type: timeseries
# ... other config ...
options:
data_links:
- title: "View Error Logs"
url: "/d/logs-dashboard?var-service=${__field.labels.service}&from=${__from}&to=${__to}"
- title: "Error Traces"
url: "/d/traces-dashboard?var-service=${__field.labels.service}"
```
### Dynamic Panel Titles
```yaml
- title: "${service} - Request Rate" # Uses template variable
type: timeseries
# Title updates automatically when service variable changes
```
## Performance Optimization
### Query Optimization
#### Use Recording Rules
```yaml
# Instead of complex queries in dashboards
groups:
- name: http_requests
rules:
- record: http_request_rate_5m
expr: sum(rate(http_requests_total[5m])) by (service, method, handler)
- record: http_request_latency_p95_5m
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
```
#### Limit Data Points
```yaml
# Good: Reasonable resolution for dashboard
- expr: http_request_rate_5m[1h]
interval: 15s # One point every 15 seconds
# Bad: Too many points for visualization
- expr: http_request_rate_1s[1h] # 3600 points!
```
### Dashboard Performance
#### Panel Limits
- **Maximum panels per dashboard**: 20-30
- **Maximum queries per panel**: 10
- **Maximum time series per panel**: 50
#### Caching Strategy
```yaml
# Use appropriate cache headers
cache_timeout: 30 # Cache for 30 seconds on fast-changing panels
cache_timeout: 300 # Cache for 5 minutes on slow-changing panels
```
## Accessibility
### Screen Reader Support
```yaml
# Provide text alternatives for visual elements
- title: "Service Health Status"
type: stat
options:
text_mode: value_and_name # Includes both value and description
field_config:
mappings:
- options:
"1":
text: "Healthy"
color: "green"
"0":
text: "Unhealthy"
color: "red"
```
### Keyboard Navigation
- Ensure all interactive elements are keyboard accessible
- Provide logical tab order
- Include skip links for complex dashboards
### High Contrast Mode
```yaml
# Test dashboards work in high contrast mode
theme: high_contrast
colors:
- "#000000" # Pure black
- "#ffffff" # Pure white
- "#ffff00" # Pure yellow
- "#ff0000" # Pure red
```
## Testing and Validation
### Dashboard Testing Checklist
#### Functional Testing
- [ ] All panels load without errors
- [ ] Template variables filter correctly
- [ ] Time range changes update all panels
- [ ] Drill-down links work as expected
- [ ] Auto-refresh functions properly
#### Visual Testing
- [ ] Dashboard renders correctly on different screen sizes
- [ ] Colors are distinguishable and meaningful
- [ ] Text is readable at normal zoom levels
- [ ] Legends and labels are clear
#### Performance Testing
- [ ] Dashboard loads in < 5 seconds
- [ ] No queries timeout under normal load
- [ ] Auto-refresh doesn't cause browser lag
- [ ] Memory usage remains reasonable
#### Usability Testing
- [ ] New team members can understand the dashboard
- [ ] Action items are clear during incidents
- [ ] Key information is quickly discoverable
- [ ] Dashboard supports common troubleshooting workflows
## Maintenance and Governance
### Dashboard Lifecycle
#### Creation
1. Define dashboard purpose and audience
2. Identify key metrics and success criteria
3. Design layout following established patterns
4. Implement with consistent styling
5. Test with real data and user scenarios
#### Maintenance
- **Weekly**: Check for broken panels or queries
- **Monthly**: Review dashboard usage analytics
- **Quarterly**: Gather user feedback and iterate
- **Annually**: Major review and potential redesign
#### Retirement
- Archive dashboards that are no longer used
- Migrate users to replacement dashboards
- Document lessons learned
### Dashboard Standards
```yaml
# Organization dashboard standards
standards:
naming_convention: "[Team] [Service] - [Purpose]"
tags: [team, service_type, environment, purpose]
refresh_intervals: [15s, 30s, 1m, 5m, 15m]
time_ranges: [5m, 15m, 1h, 4h, 1d, 7d, 30d]
color_scheme: "company_standard"
max_panels_per_dashboard: 25
```
## Advanced Patterns
### Composite Dashboards
```yaml
# Dashboard that includes panels from other dashboards
- title: "Service Overview"
type: dashlist
targets:
- "service-health"
- "service-performance"
- "service-business-metrics"
options:
show_headings: true
max_items: 10
```
### Dynamic Dashboard Generation
```python
# Generate dashboards from service definitions
def generate_service_dashboard(service_config):
panels = []
# Always include golden signals
panels.extend(generate_golden_signals_panels(service_config))
# Add service-specific panels
if service_config.type == 'database':
panels.extend(generate_database_panels(service_config))
elif service_config.type == 'queue':
panels.extend(generate_queue_panels(service_config))
return {
'title': f"{service_config.name} - Operational Dashboard",
'panels': panels,
'variables': generate_variables(service_config)
}
```
### A/B Testing for Dashboards
```yaml
# Test different dashboard designs with different teams
experiment:
name: "dashboard_layout_test"
variants:
- name: "traditional_layout"
weight: 50
config: "dashboard_v1.json"
- name: "f_pattern_layout"
weight: 50
config: "dashboard_v2.json"
success_metrics:
- "time_to_insight"
- "user_satisfaction"
- "troubleshooting_efficiency"
```
Remember: A dashboard should tell a story about your system's health and guide users toward the right actions. Focus on clarity over complexity, and always optimize for the person who will use it during a stressful incident.

View File

@@ -0,0 +1,329 @@
# SLO Cookbook: A Practical Guide to Service Level Objectives
## Introduction
Service Level Objectives (SLOs) are a key tool for managing service reliability. This cookbook provides practical guidance for implementing SLOs that actually improve system reliability rather than just creating meaningless metrics.
## Fundamentals
### The SLI/SLO/SLA Hierarchy
- **SLI (Service Level Indicator)**: A quantifiable measure of service quality
- **SLO (Service Level Objective)**: A target range of values for an SLI
- **SLA (Service Level Agreement)**: A business agreement with consequences for missing SLO targets
### Golden Rule of SLOs
**Start simple, iterate based on learning.** Your first SLOs won't be perfect, and that's okay.
## Choosing Good SLIs
### The Four Golden Signals
1. **Latency**: How long requests take to complete
2. **Traffic**: How many requests are coming in
3. **Errors**: How many requests are failing
4. **Saturation**: How "full" your service is
### SLI Selection Criteria
A good SLI should be:
- **Measurable**: You can collect data for it
- **Meaningful**: It reflects user experience
- **Controllable**: You can take action to improve it
- **Proportional**: Changes in the SLI reflect changes in user happiness
### Service Type Specific SLIs
#### HTTP APIs
- **Request latency**: P95 or P99 response time
- **Availability**: Proportion of successful requests (non-5xx)
- **Throughput**: Requests per second capacity
```prometheus
# Availability SLI
sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Latency SLI
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
#### Batch Jobs
- **Freshness**: Age of the last successful run
- **Correctness**: Proportion of jobs completing successfully
- **Throughput**: Items processed per unit time
#### Data Pipelines
- **Data freshness**: Time since last successful update
- **Data quality**: Proportion of records passing validation
- **Processing latency**: Time from ingestion to availability
### Anti-Patterns in SLI Selection
**Don't use**: CPU usage, memory usage, disk space as primary SLIs
- These are symptoms, not user-facing impacts
**Don't use**: Counts instead of rates or proportions
- "Number of errors" vs "Error rate"
**Don't use**: Internal metrics that users don't care about
- Queue depth, cache hit rate (unless they directly impact user experience)
## Setting SLO Targets
### The Art of Target Setting
Setting SLO targets is balancing act between:
- **User happiness**: Targets should reflect acceptable user experience
- **Business value**: Tighter SLOs cost more to maintain
- **Current performance**: Targets should be achievable but aspirational
### Target Setting Strategies
#### Historical Performance Method
1. Collect 4-6 weeks of historical data
2. Calculate the worst user-visible performance in that period
3. Set your SLO slightly better than the worst acceptable performance
#### User Journey Mapping
1. Map critical user journeys
2. Identify acceptable performance for each step
3. Work backwards to component SLOs
#### Error Budget Approach
1. Decide how much unreliability you can afford
2. Set SLO targets based on acceptable error budget consumption
3. Example: 99.9% availability = 43.8 minutes downtime per month
### SLO Target Examples by Service Criticality
#### Critical Services (Revenue Impact)
- **Availability**: 99.95% - 99.99%
- **Latency (P95)**: 100-200ms
- **Error Rate**: < 0.1%
#### High Priority Services
- **Availability**: 99.9% - 99.95%
- **Latency (P95)**: 200-500ms
- **Error Rate**: < 0.5%
#### Standard Services
- **Availability**: 99.5% - 99.9%
- **Latency (P95)**: 500ms - 1s
- **Error Rate**: < 1%
## Error Budget Management
### What is an Error Budget?
Your error budget is the maximum amount of unreliability you can accumulate while still meeting your SLO. It's calculated as:
```
Error Budget = (1 - SLO) × Time Window
```
For a 99.9% availability SLO over 30 days:
```
Error Budget = (1 - 0.999) × 30 days = 0.001 × 30 days = 43.8 minutes
```
### Error Budget Policies
Define what happens when you consume your error budget:
#### Conservative Policy (High-Risk Services)
- **> 50% consumed**: Freeze non-critical feature releases
- **> 75% consumed**: Focus entirely on reliability improvements
- **> 90% consumed**: Consider emergency measures (traffic shaping, etc.)
#### Balanced Policy (Standard Services)
- **> 75% consumed**: Increase focus on reliability work
- **> 90% consumed**: Pause feature work, focus on reliability
#### Aggressive Policy (Early Stage Services)
- **> 90% consumed**: Review but continue normal operations
- **100% consumed**: Evaluate SLO appropriateness
### Burn Rate Alerting
Multi-window burn rate alerts help you catch SLO violations before they become critical:
```yaml
# Fast burn: 2% budget consumed in 1 hour
- alert: FastBurnSLOViolation
expr: (
(1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * 0.001)
and
(1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * 0.001)
)
for: 2m
# Slow burn: 10% budget consumed in 3 days
- alert: SlowBurnSLOViolation
expr: (
(1 - (sum(rate(http_requests_total{code!~"5.."}[6h])) / sum(rate(http_requests_total[6h])))) > (1.0 * 0.001)
and
(1 - (sum(rate(http_requests_total{code!~"5.."}[3d])) / sum(rate(http_requests_total[3d])))) > (1.0 * 0.001)
)
for: 15m
```
## Implementation Patterns
### The SLO Implementation Ladder
#### Level 1: Basic SLOs
- Choose 1-2 SLIs that matter most to users
- Set aspirational but achievable targets
- Implement basic alerting when SLOs are missed
#### Level 2: Operational SLOs
- Add burn rate alerting
- Create error budget dashboards
- Establish error budget policies
- Regular SLO review meetings
#### Level 3: Advanced SLOs
- Multi-window burn rate alerts
- Automated error budget policy enforcement
- SLO-driven incident prioritization
- Integration with CI/CD for deployment decisions
### SLO Measurement Architecture
#### Push vs Pull Metrics
- **Pull** (Prometheus): Good for infrastructure metrics, real-time alerting
- **Push** (StatsD): Good for application metrics, business events
#### Measurement Points
- **Server-side**: More reliable, easier to implement
- **Client-side**: Better reflects user experience
- **Synthetic**: Consistent, predictable, may not reflect real user experience
### SLO Dashboard Design
Essential elements for SLO dashboards:
1. **Current SLO Achievement**: Large, prominent display
2. **Error Budget Remaining**: Visual indicator (gauge, progress bar)
3. **Burn Rate**: Time series showing error budget consumption rate
4. **Historical Trends**: 4-week view of SLO achievement
5. **Alerts**: Current and recent SLO-related alerts
## Advanced Topics
### Dependency SLOs
For services with dependencies:
```
SLO_service ≤ min(SLO_inherent, ∏SLO_dependencies)
```
If your service depends on 3 other services each with 99.9% SLO:
```
Maximum_SLO = 0.999³ = 0.997 = 99.7%
```
### User Journey SLOs
Track end-to-end user experiences:
```prometheus
# Registration success rate
sum(rate(user_registration_success_total[5m])) / sum(rate(user_registration_attempts_total[5m]))
# Purchase completion latency
histogram_quantile(0.95, rate(purchase_completion_duration_seconds_bucket[5m]))
```
### SLOs for Batch Systems
Special considerations for non-request/response systems:
#### Freshness SLO
```prometheus
# Data should be no more than 4 hours old
(time() - last_successful_update_timestamp) < (4 * 3600)
```
#### Throughput SLO
```prometheus
# Should process at least 1000 items per hour
rate(items_processed_total[1h]) >= 1000
```
#### Quality SLO
```prometheus
# At least 99.5% of records should pass validation
sum(rate(records_valid_total[5m])) / sum(rate(records_processed_total[5m])) >= 0.995
```
## Common Mistakes and How to Avoid Them
### Mistake 1: Too Many SLOs
**Problem**: Drowning in metrics, losing focus
**Solution**: Start with 1-2 SLOs per service, add more only when needed
### Mistake 2: Internal Metrics as SLIs
**Problem**: Optimizing for metrics that don't impact users
**Solution**: Always ask "If this metric changes, do users notice?"
### Mistake 3: Perfectionist SLOs
**Problem**: 99.99% SLO when 99.9% would be fine
**Solution**: Higher SLOs cost exponentially more; pick the minimum acceptable level
### Mistake 4: Ignoring Error Budgets
**Problem**: Treating any SLO miss as an emergency
**Solution**: Error budgets exist to be spent; use them to balance feature velocity and reliability
### Mistake 5: Static SLOs
**Problem**: Setting SLOs once and never updating them
**Solution**: Review SLOs quarterly; adjust based on user feedback and business changes
## SLO Review Process
### Monthly SLO Review Agenda
1. **SLO Achievement Review**: Did we meet our SLOs?
2. **Error Budget Analysis**: How did we spend our error budget?
3. **Incident Correlation**: Which incidents impacted our SLOs?
4. **SLI Quality Assessment**: Are our SLIs still meaningful?
5. **Target Adjustment**: Should we change any targets?
### Quarterly SLO Health Check
1. **User Impact Validation**: Survey users about acceptable performance
2. **Business Alignment**: Do SLOs still reflect business priorities?
3. **Measurement Quality**: Are we measuring the right things?
4. **Cost/Benefit Analysis**: Are tighter SLOs worth the investment?
## Tooling and Automation
### Essential Tools
1. **Metrics Collection**: Prometheus, InfluxDB, CloudWatch
2. **Alerting**: Alertmanager, PagerDuty, OpsGenie
3. **Dashboards**: Grafana, DataDog, New Relic
4. **SLO Platforms**: Sloth, Pyrra, Service Level Blue
### Automation Opportunities
- **Burn rate alert generation** from SLO definitions
- **Dashboard creation** from SLO specifications
- **Error budget calculation** and tracking
- **Release blocking** based on error budget consumption
## Getting Started Checklist
- [ ] Identify your service's critical user journeys
- [ ] Choose 1-2 SLIs that best reflect user experience
- [ ] Collect 4-6 weeks of baseline data
- [ ] Set initial SLO targets based on historical performance
- [ ] Implement basic SLO monitoring and alerting
- [ ] Create an SLO dashboard
- [ ] Define error budget policies
- [ ] Schedule monthly SLO reviews
- [ ] Plan for quarterly SLO health checks
Remember: SLOs are a journey, not a destination. Start simple, learn from experience, and iterate toward better reliability management.