# Alert Design Patterns: A Guide to Effective Alerting ## Introduction Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue. ## Fundamental Principles ### The Golden Rules of Alerting 1. **Every alert should be actionable** - If you can't do something about it, don't alert 2. **Every alert should require human intelligence** - If a script can handle it, automate the response 3. **Every alert should be novel** - Don't alert on known, ongoing issues 4. **Every alert should represent a user-visible impact** - Internal metrics matter only if users are affected ### Alert Classification #### Critical Alerts - Service is completely down - Data loss is occurring - Security breach detected - SLO burn rate indicates imminent SLO violation #### Warning Alerts - Service degradation affecting some users - Approaching resource limits - Dependent service issues - Elevated error rates within SLO #### Info Alerts - Deployment notifications - Capacity planning triggers - Configuration changes - Maintenance windows ## Alert Design Patterns ### Pattern 1: Symptoms, Not Causes **Good**: Alert on user-visible symptoms ```yaml - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m annotations: summary: "API latency is high" description: "95th percentile latency is {{ $value }}s, above 500ms threshold" ``` **Bad**: Alert on internal metrics that may not affect users ```yaml - alert: HighCPU expr: cpu_usage > 80 # This might not affect users at all! ``` ### Pattern 2: Multi-Window Alerting Reduce false positives by requiring sustained problems: ```yaml - alert: ServiceDown expr: ( avg_over_time(up[2m]) == 0 # Short window: immediate detection and avg_over_time(up[10m]) < 0.8 # Long window: avoid flapping ) for: 1m ``` ### Pattern 3: Burn Rate Alerting Alert based on error budget consumption rate: ```yaml # Fast burn: 2% of monthly budget in 1 hour - alert: ErrorBudgetFastBurn expr: ( error_rate_5m > (14.4 * error_budget_slo) and error_rate_1h > (14.4 * error_budget_slo) ) for: 2m labels: severity: critical # Slow burn: 10% of monthly budget in 3 days - alert: ErrorBudgetSlowBurn expr: ( error_rate_6h > (1.0 * error_budget_slo) and error_rate_3d > (1.0 * error_budget_slo) ) for: 15m labels: severity: warning ``` ### Pattern 4: Hysteresis Use different thresholds for firing and resolving to prevent flapping: ```yaml - alert: HighErrorRate expr: error_rate > 0.05 # Fire at 5% for: 5m # Resolution happens automatically when error_rate < 0.03 (3%) # This prevents flapping around the 5% threshold ``` ### Pattern 5: Composite Alerts Alert when multiple conditions indicate a problem: ```yaml - alert: ServiceDegraded expr: ( (latency_p95 > latency_threshold) or (error_rate > error_threshold) or (availability < availability_threshold) ) and ( request_rate > min_request_rate # Only alert if we have traffic ) ``` ### Pattern 6: Contextual Alerting Include relevant context in alerts: ```yaml - alert: DatabaseConnections expr: db_connections_active / db_connections_max > 0.8 for: 5m annotations: summary: "Database connection pool nearly exhausted" description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization" runbook_url: "https://runbooks.company.com/database-connections" impact: "New requests may be rejected, causing 500 errors" suggested_action: "Check for connection leaks or increase pool size" ``` ## Alert Routing and Escalation ### Routing by Impact and Urgency #### Critical Path Services ```yaml route: group_by: ['service'] routes: - match: service: 'payment-api' severity: 'critical' receiver: 'payment-team-pager' continue: true - match: service: 'payment-api' severity: 'warning' receiver: 'payment-team-slack' ``` #### Time-Based Routing ```yaml route: routes: - match: severity: 'critical' receiver: 'oncall-pager' - match: severity: 'warning' time: 'business_hours' # 9 AM - 5 PM receiver: 'team-slack' - match: severity: 'warning' time: 'after_hours' receiver: 'team-email' # Lower urgency outside business hours ``` ### Escalation Patterns #### Linear Escalation ```yaml receivers: - name: 'primary-oncall' pagerduty_configs: - escalation_policy: 'P1-Escalation' # 0 min: Primary on-call # 5 min: Secondary on-call # 15 min: Engineering manager # 30 min: Director of engineering ``` #### Severity-Based Escalation ```yaml # Critical: Immediate escalation - match: severity: 'critical' receiver: 'critical-escalation' # Warning: Team-first escalation - match: severity: 'warning' receiver: 'team-escalation' ``` ## Alert Fatigue Prevention ### Grouping and Suppression #### Time-Based Grouping ```yaml route: group_wait: 30s # Wait 30s to group similar alerts group_interval: 2m # Send grouped alerts every 2 minutes repeat_interval: 1h # Re-send unresolved alerts every hour ``` #### Dependent Service Suppression ```yaml - alert: ServiceDown expr: up == 0 - alert: HighLatency expr: latency_p95 > 1 # This alert is suppressed when ServiceDown is firing inhibit_rules: - source_match: alertname: 'ServiceDown' target_match: alertname: 'HighLatency' equal: ['service'] ``` ### Alert Throttling ```yaml # Limit to 1 alert per 10 minutes for noisy conditions - alert: HighMemoryUsage expr: memory_usage_percent > 85 for: 10m # Longer 'for' duration reduces noise annotations: summary: "Memory usage has been high for 10+ minutes" ``` ### Smart Defaults ```yaml # Use business logic to set intelligent thresholds - alert: LowTraffic expr: request_rate < ( avg_over_time(request_rate[7d]) * 0.1 # 10% of weekly average ) # Only alert during business hours when low traffic is unusual for: 30m ``` ## Runbook Integration ### Runbook Structure Template ```markdown # Alert: {{ $labels.alertname }} ## Immediate Actions 1. Check service status dashboard 2. Verify if users are affected 3. Look at recent deployments/changes ## Investigation Steps 1. Check logs for errors in the last 30 minutes 2. Verify dependent services are healthy 3. Check resource utilization (CPU, memory, disk) 4. Review recent alerts for patterns ## Resolution Actions - If deployment-related: Consider rollback - If resource-related: Scale up or optimize queries - If dependency-related: Engage appropriate team ## Escalation - Primary: @team-oncall - Secondary: @engineering-manager - Emergency: @site-reliability-team ``` ### Runbook Integration in Alerts ```yaml annotations: runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}" quick_debug: | 1. curl -s https://{{ $labels.instance }}/health 2. kubectl logs {{ $labels.pod }} --tail=50 3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }} ``` ## Testing and Validation ### Alert Testing Strategies #### Chaos Engineering Integration ```python # Test that alerts fire during controlled failures def test_alert_during_cpu_spike(): with chaos.cpu_spike(target='payment-api', duration='2m'): assert wait_for_alert('HighCPU', timeout=180) def test_alert_during_network_partition(): with chaos.network_partition(target='database'): assert wait_for_alert('DatabaseUnreachable', timeout=60) ``` #### Historical Alert Analysis ```prometheus # Query to find alerts that fired without incidents count by (alertname) ( ALERTS{alertstate="firing"}[30d] ) unless on (alertname) ( count by (alertname) ( incident_created{source="alert"}[30d] ) ) ``` ### Alert Quality Metrics #### Alert Precision ``` Precision = True Positives / (True Positives + False Positives) ``` Track alerts that resulted in actual incidents vs false alarms. #### Time to Resolution ```prometheus # Average time from alert firing to resolution avg_over_time( (alert_resolved_timestamp - alert_fired_timestamp)[30d] ) by (alertname) ``` #### Alert Fatigue Indicators ```prometheus # Alerts per day by team sum by (team) ( increase(alerts_fired_total[1d]) ) # Percentage of alerts acknowledged within 15 minutes sum(alerts_acked_within_15m) / sum(alerts_fired) * 100 ``` ## Advanced Patterns ### Machine Learning-Enhanced Alerting #### Anomaly Detection ```yaml - alert: AnomalousTraffic expr: | abs(request_rate - predict_linear(request_rate[1h], 300)) / stddev_over_time(request_rate[1h]) > 3 for: 10m annotations: summary: "Traffic pattern is anomalous" description: "Current traffic deviates from predicted pattern by >3 standard deviations" ``` #### Dynamic Thresholds ```yaml - alert: DynamicHighLatency expr: | latency_p95 > ( quantile_over_time(0.95, latency_p95[7d]) + # Historical 95th percentile 2 * stddev_over_time(latency_p95[7d]) # Plus 2 standard deviations ) ``` ### Business Hours Awareness ```yaml # Different thresholds for business vs off hours - alert: HighLatencyBusinessHours expr: latency_p95 > 0.2 # Stricter during business hours for: 2m # Active 9 AM - 5 PM weekdays - alert: HighLatencyOffHours expr: latency_p95 > 0.5 # More lenient after hours for: 5m # Active nights and weekends ``` ### Progressive Alerting ```yaml # Escalating alert severity based on duration - alert: ServiceLatencyElevated expr: latency_p95 > 0.5 for: 5m labels: severity: info - alert: ServiceLatencyHigh expr: latency_p95 > 0.5 for: 15m # Same condition, longer duration labels: severity: warning - alert: ServiceLatencyCritical expr: latency_p95 > 0.5 for: 30m # Same condition, even longer duration labels: severity: critical ``` ## Anti-Patterns to Avoid ### Anti-Pattern 1: Alerting on Everything **Problem**: Too many alerts create noise and fatigue **Solution**: Be selective; only alert on user-impacting issues ### Anti-Pattern 2: Vague Alert Messages **Problem**: "Service X is down" - which instance? what's the impact? **Solution**: Include specific details and context ### Anti-Pattern 3: Alerts Without Runbooks **Problem**: Alerts that don't explain what to do **Solution**: Every alert must have an associated runbook ### Anti-Pattern 4: Static Thresholds **Problem**: 80% CPU might be normal during peak hours **Solution**: Use contextual, adaptive thresholds ### Anti-Pattern 5: Ignoring Alert Quality **Problem**: Accepting high false positive rates **Solution**: Regularly review and tune alert precision ## Implementation Checklist ### Pre-Implementation - [ ] Define alert severity levels and escalation policies - [ ] Create runbook templates - [ ] Set up alert routing configuration - [ ] Define SLOs that alerts will protect ### Alert Development - [ ] Each alert has clear success criteria - [ ] Alert conditions tested against historical data - [ ] Runbook created and accessible - [ ] Severity and routing configured - [ ] Context and suggested actions included ### Post-Implementation - [ ] Monitor alert precision and recall - [ ] Regular review of alert fatigue metrics - [ ] Quarterly alert effectiveness review - [ ] Team training on alert response procedures ### Quality Assurance - [ ] Test alerts fire during controlled failures - [ ] Verify alerts resolve when conditions improve - [ ] Confirm runbooks are accurate and helpful - [ ] Validate escalation paths work correctly Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM.