12 KiB
Alert Design Patterns: A Guide to Effective Alerting
Introduction
Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue.
Fundamental Principles
The Golden Rules of Alerting
- Every alert should be actionable - If you can't do something about it, don't alert
- Every alert should require human intelligence - If a script can handle it, automate the response
- Every alert should be novel - Don't alert on known, ongoing issues
- Every alert should represent a user-visible impact - Internal metrics matter only if users are affected
Alert Classification
Critical Alerts
- Service is completely down
- Data loss is occurring
- Security breach detected
- SLO burn rate indicates imminent SLO violation
Warning Alerts
- Service degradation affecting some users
- Approaching resource limits
- Dependent service issues
- Elevated error rates within SLO
Info Alerts
- Deployment notifications
- Capacity planning triggers
- Configuration changes
- Maintenance windows
Alert Design Patterns
Pattern 1: Symptoms, Not Causes
Good: Alert on user-visible symptoms
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
annotations:
summary: "API latency is high"
description: "95th percentile latency is {{ $value }}s, above 500ms threshold"
Bad: Alert on internal metrics that may not affect users
- alert: HighCPU
expr: cpu_usage > 80
# This might not affect users at all!
Pattern 2: Multi-Window Alerting
Reduce false positives by requiring sustained problems:
- alert: ServiceDown
expr: (
avg_over_time(up[2m]) == 0 # Short window: immediate detection
and
avg_over_time(up[10m]) < 0.8 # Long window: avoid flapping
)
for: 1m
Pattern 3: Burn Rate Alerting
Alert based on error budget consumption rate:
# Fast burn: 2% of monthly budget in 1 hour
- alert: ErrorBudgetFastBurn
expr: (
error_rate_5m > (14.4 * error_budget_slo)
and
error_rate_1h > (14.4 * error_budget_slo)
)
for: 2m
labels:
severity: critical
# Slow burn: 10% of monthly budget in 3 days
- alert: ErrorBudgetSlowBurn
expr: (
error_rate_6h > (1.0 * error_budget_slo)
and
error_rate_3d > (1.0 * error_budget_slo)
)
for: 15m
labels:
severity: warning
Pattern 4: Hysteresis
Use different thresholds for firing and resolving to prevent flapping:
- alert: HighErrorRate
expr: error_rate > 0.05 # Fire at 5%
for: 5m
# Resolution happens automatically when error_rate < 0.03 (3%)
# This prevents flapping around the 5% threshold
Pattern 5: Composite Alerts
Alert when multiple conditions indicate a problem:
- alert: ServiceDegraded
expr: (
(latency_p95 > latency_threshold)
or
(error_rate > error_threshold)
or
(availability < availability_threshold)
) and (
request_rate > min_request_rate # Only alert if we have traffic
)
Pattern 6: Contextual Alerting
Include relevant context in alerts:
- alert: DatabaseConnections
expr: db_connections_active / db_connections_max > 0.8
for: 5m
annotations:
summary: "Database connection pool nearly exhausted"
description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization"
runbook_url: "https://runbooks.company.com/database-connections"
impact: "New requests may be rejected, causing 500 errors"
suggested_action: "Check for connection leaks or increase pool size"
Alert Routing and Escalation
Routing by Impact and Urgency
Critical Path Services
route:
group_by: ['service']
routes:
- match:
service: 'payment-api'
severity: 'critical'
receiver: 'payment-team-pager'
continue: true
- match:
service: 'payment-api'
severity: 'warning'
receiver: 'payment-team-slack'
Time-Based Routing
route:
routes:
- match:
severity: 'critical'
receiver: 'oncall-pager'
- match:
severity: 'warning'
time: 'business_hours' # 9 AM - 5 PM
receiver: 'team-slack'
- match:
severity: 'warning'
time: 'after_hours'
receiver: 'team-email' # Lower urgency outside business hours
Escalation Patterns
Linear Escalation
receivers:
- name: 'primary-oncall'
pagerduty_configs:
- escalation_policy: 'P1-Escalation'
# 0 min: Primary on-call
# 5 min: Secondary on-call
# 15 min: Engineering manager
# 30 min: Director of engineering
Severity-Based Escalation
# Critical: Immediate escalation
- match:
severity: 'critical'
receiver: 'critical-escalation'
# Warning: Team-first escalation
- match:
severity: 'warning'
receiver: 'team-escalation'
Alert Fatigue Prevention
Grouping and Suppression
Time-Based Grouping
route:
group_wait: 30s # Wait 30s to group similar alerts
group_interval: 2m # Send grouped alerts every 2 minutes
repeat_interval: 1h # Re-send unresolved alerts every hour
Dependent Service Suppression
- alert: ServiceDown
expr: up == 0
- alert: HighLatency
expr: latency_p95 > 1
# This alert is suppressed when ServiceDown is firing
inhibit_rules:
- source_match:
alertname: 'ServiceDown'
target_match:
alertname: 'HighLatency'
equal: ['service']
Alert Throttling
# Limit to 1 alert per 10 minutes for noisy conditions
- alert: HighMemoryUsage
expr: memory_usage_percent > 85
for: 10m # Longer 'for' duration reduces noise
annotations:
summary: "Memory usage has been high for 10+ minutes"
Smart Defaults
# Use business logic to set intelligent thresholds
- alert: LowTraffic
expr: request_rate < (
avg_over_time(request_rate[7d]) * 0.1 # 10% of weekly average
)
# Only alert during business hours when low traffic is unusual
for: 30m
Runbook Integration
Runbook Structure Template
# Alert: {{ $labels.alertname }}
## Immediate Actions
1. Check service status dashboard
2. Verify if users are affected
3. Look at recent deployments/changes
## Investigation Steps
1. Check logs for errors in the last 30 minutes
2. Verify dependent services are healthy
3. Check resource utilization (CPU, memory, disk)
4. Review recent alerts for patterns
## Resolution Actions
- If deployment-related: Consider rollback
- If resource-related: Scale up or optimize queries
- If dependency-related: Engage appropriate team
## Escalation
- Primary: @team-oncall
- Secondary: @engineering-manager
- Emergency: @site-reliability-team
Runbook Integration in Alerts
annotations:
runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}"
quick_debug: |
1. curl -s https://{{ $labels.instance }}/health
2. kubectl logs {{ $labels.pod }} --tail=50
3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }}
Testing and Validation
Alert Testing Strategies
Chaos Engineering Integration
# Test that alerts fire during controlled failures
def test_alert_during_cpu_spike():
with chaos.cpu_spike(target='payment-api', duration='2m'):
assert wait_for_alert('HighCPU', timeout=180)
def test_alert_during_network_partition():
with chaos.network_partition(target='database'):
assert wait_for_alert('DatabaseUnreachable', timeout=60)
Historical Alert Analysis
# Query to find alerts that fired without incidents
count by (alertname) (
ALERTS{alertstate="firing"}[30d]
) unless on (alertname) (
count by (alertname) (
incident_created{source="alert"}[30d]
)
)
Alert Quality Metrics
Alert Precision
Precision = True Positives / (True Positives + False Positives)
Track alerts that resulted in actual incidents vs false alarms.
Time to Resolution
# Average time from alert firing to resolution
avg_over_time(
(alert_resolved_timestamp - alert_fired_timestamp)[30d]
) by (alertname)
Alert Fatigue Indicators
# Alerts per day by team
sum by (team) (
increase(alerts_fired_total[1d])
)
# Percentage of alerts acknowledged within 15 minutes
sum(alerts_acked_within_15m) / sum(alerts_fired) * 100
Advanced Patterns
Machine Learning-Enhanced Alerting
Anomaly Detection
- alert: AnomalousTraffic
expr: |
abs(request_rate - predict_linear(request_rate[1h], 300)) /
stddev_over_time(request_rate[1h]) > 3
for: 10m
annotations:
summary: "Traffic pattern is anomalous"
description: "Current traffic deviates from predicted pattern by >3 standard deviations"
Dynamic Thresholds
- alert: DynamicHighLatency
expr: |
latency_p95 > (
quantile_over_time(0.95, latency_p95[7d]) + # Historical 95th percentile
2 * stddev_over_time(latency_p95[7d]) # Plus 2 standard deviations
)
Business Hours Awareness
# Different thresholds for business vs off hours
- alert: HighLatencyBusinessHours
expr: latency_p95 > 0.2 # Stricter during business hours
for: 2m
# Active 9 AM - 5 PM weekdays
- alert: HighLatencyOffHours
expr: latency_p95 > 0.5 # More lenient after hours
for: 5m
# Active nights and weekends
Progressive Alerting
# Escalating alert severity based on duration
- alert: ServiceLatencyElevated
expr: latency_p95 > 0.5
for: 5m
labels:
severity: info
- alert: ServiceLatencyHigh
expr: latency_p95 > 0.5
for: 15m # Same condition, longer duration
labels:
severity: warning
- alert: ServiceLatencyCritical
expr: latency_p95 > 0.5
for: 30m # Same condition, even longer duration
labels:
severity: critical
Anti-Patterns to Avoid
Anti-Pattern 1: Alerting on Everything
Problem: Too many alerts create noise and fatigue Solution: Be selective; only alert on user-impacting issues
Anti-Pattern 2: Vague Alert Messages
Problem: "Service X is down" - which instance? what's the impact? Solution: Include specific details and context
Anti-Pattern 3: Alerts Without Runbooks
Problem: Alerts that don't explain what to do Solution: Every alert must have an associated runbook
Anti-Pattern 4: Static Thresholds
Problem: 80% CPU might be normal during peak hours Solution: Use contextual, adaptive thresholds
Anti-Pattern 5: Ignoring Alert Quality
Problem: Accepting high false positive rates Solution: Regularly review and tune alert precision
Implementation Checklist
Pre-Implementation
- Define alert severity levels and escalation policies
- Create runbook templates
- Set up alert routing configuration
- Define SLOs that alerts will protect
Alert Development
- Each alert has clear success criteria
- Alert conditions tested against historical data
- Runbook created and accessible
- Severity and routing configured
- Context and suggested actions included
Post-Implementation
- Monitor alert precision and recall
- Regular review of alert fatigue metrics
- Quarterly alert effectiveness review
- Team training on alert response procedures
Quality Assurance
- Test alerts fire during controlled failures
- Verify alerts resolve when conditions improve
- Confirm runbooks are accurate and helpful
- Validate escalation paths work correctly
Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM.