Files

2026-03-12 15:17:52 +07:00

12 KiB

Raw Blame History

Alert Design Patterns: A Guide to Effective Alerting

Introduction

Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue.

Fundamental Principles

The Golden Rules of Alerting

Every alert should be actionable - If you can't do something about it, don't alert
Every alert should require human intelligence - If a script can handle it, automate the response
Every alert should be novel - Don't alert on known, ongoing issues
Every alert should represent a user-visible impact - Internal metrics matter only if users are affected

Alert Classification

Critical Alerts

Service is completely down
Data loss is occurring
Security breach detected
SLO burn rate indicates imminent SLO violation

Warning Alerts

Service degradation affecting some users
Approaching resource limits
Dependent service issues
Elevated error rates within SLO

Info Alerts

Deployment notifications
Capacity planning triggers
Configuration changes
Maintenance windows

Alert Design Patterns

Pattern 1: Symptoms, Not Causes

Good: Alert on user-visible symptoms

- alert: HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  annotations:
    summary: "API latency is high"
    description: "95th percentile latency is {{ $value }}s, above 500ms threshold"

Bad: Alert on internal metrics that may not affect users

- alert: HighCPU
  expr: cpu_usage > 80
  # This might not affect users at all!

Pattern 2: Multi-Window Alerting

Reduce false positives by requiring sustained problems:

- alert: ServiceDown
  expr: (
    avg_over_time(up[2m]) == 0  # Short window: immediate detection
    and
    avg_over_time(up[10m]) < 0.8  # Long window: avoid flapping
  )
  for: 1m

Pattern 3: Burn Rate Alerting

Alert based on error budget consumption rate:

# Fast burn: 2% of monthly budget in 1 hour
- alert: ErrorBudgetFastBurn  
  expr: (
    error_rate_5m > (14.4 * error_budget_slo)
    and
    error_rate_1h > (14.4 * error_budget_slo)
  )
  for: 2m
  labels:
    severity: critical
    
# Slow burn: 10% of monthly budget in 3 days
- alert: ErrorBudgetSlowBurn
  expr: (
    error_rate_6h > (1.0 * error_budget_slo)
    and  
    error_rate_3d > (1.0 * error_budget_slo)
  )
  for: 15m
  labels:
    severity: warning

Pattern 4: Hysteresis

Use different thresholds for firing and resolving to prevent flapping:

- alert: HighErrorRate
  expr: error_rate > 0.05  # Fire at 5%
  for: 5m
  
# Resolution happens automatically when error_rate < 0.03 (3%)
# This prevents flapping around the 5% threshold

Pattern 5: Composite Alerts

Alert when multiple conditions indicate a problem:

- alert: ServiceDegraded
  expr: (
    (latency_p95 > latency_threshold)
    or
    (error_rate > error_threshold)
    or 
    (availability < availability_threshold)
  ) and (
    request_rate > min_request_rate  # Only alert if we have traffic
  )

Pattern 6: Contextual Alerting

Include relevant context in alerts:

- alert: DatabaseConnections
  expr: db_connections_active / db_connections_max > 0.8
  for: 5m
  annotations:
    summary: "Database connection pool nearly exhausted"
    description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization"
    runbook_url: "https://runbooks.company.com/database-connections"
    impact: "New requests may be rejected, causing 500 errors"
    suggested_action: "Check for connection leaks or increase pool size"

Alert Routing and Escalation

Routing by Impact and Urgency

Critical Path Services

route:
  group_by: ['service']
  routes:
  - match:
      service: 'payment-api'
      severity: 'critical'
    receiver: 'payment-team-pager'
    continue: true
  - match:
      service: 'payment-api' 
      severity: 'warning'
    receiver: 'payment-team-slack'

Time-Based Routing

route:
  routes:
  - match:
      severity: 'critical'
    receiver: 'oncall-pager'
  - match:
      severity: 'warning'
      time: 'business_hours'  # 9 AM - 5 PM
    receiver: 'team-slack'
  - match:
      severity: 'warning'
      time: 'after_hours'
    receiver: 'team-email'  # Lower urgency outside business hours

Escalation Patterns

Linear Escalation

receivers:
- name: 'primary-oncall'
  pagerduty_configs:
  - escalation_policy: 'P1-Escalation'
    # 0 min: Primary on-call
    # 5 min: Secondary on-call  
    # 15 min: Engineering manager
    # 30 min: Director of engineering

Severity-Based Escalation

# Critical: Immediate escalation
- match:
    severity: 'critical'
  receiver: 'critical-escalation'
  
# Warning: Team-first escalation
- match:
    severity: 'warning'
  receiver: 'team-escalation'

Alert Fatigue Prevention

Grouping and Suppression

Time-Based Grouping

route:
  group_wait: 30s        # Wait 30s to group similar alerts
  group_interval: 2m     # Send grouped alerts every 2 minutes
  repeat_interval: 1h    # Re-send unresolved alerts every hour

Dependent Service Suppression

- alert: ServiceDown
  expr: up == 0
  
- alert: HighLatency
  expr: latency_p95 > 1
  # This alert is suppressed when ServiceDown is firing
  inhibit_rules:
  - source_match:
      alertname: 'ServiceDown'
    target_match:
      alertname: 'HighLatency'
    equal: ['service']

Alert Throttling

# Limit to 1 alert per 10 minutes for noisy conditions
- alert: HighMemoryUsage
  expr: memory_usage_percent > 85
  for: 10m  # Longer 'for' duration reduces noise
  annotations:
    summary: "Memory usage has been high for 10+ minutes"

Smart Defaults

# Use business logic to set intelligent thresholds
- alert: LowTraffic
  expr: request_rate < (
    avg_over_time(request_rate[7d]) * 0.1  # 10% of weekly average
  )
  # Only alert during business hours when low traffic is unusual
  for: 30m

Runbook Integration

Runbook Structure Template

# Alert: {{ $labels.alertname }}

## Immediate Actions
1. Check service status dashboard
2. Verify if users are affected
3. Look at recent deployments/changes

## Investigation Steps
1. Check logs for errors in the last 30 minutes
2. Verify dependent services are healthy  
3. Check resource utilization (CPU, memory, disk)
4. Review recent alerts for patterns

## Resolution Actions
- If deployment-related: Consider rollback
- If resource-related: Scale up or optimize queries
- If dependency-related: Engage appropriate team

## Escalation
- Primary: @team-oncall
- Secondary: @engineering-manager  
- Emergency: @site-reliability-team

Runbook Integration in Alerts

annotations:
  runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}"
  quick_debug: |
    1. curl -s https://{{ $labels.instance }}/health
    2. kubectl logs {{ $labels.pod }} --tail=50
    3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }}

Testing and Validation

Alert Testing Strategies

Chaos Engineering Integration

# Test that alerts fire during controlled failures
def test_alert_during_cpu_spike():
    with chaos.cpu_spike(target='payment-api', duration='2m'):
        assert wait_for_alert('HighCPU', timeout=180)
        
def test_alert_during_network_partition():
    with chaos.network_partition(target='database'):
        assert wait_for_alert('DatabaseUnreachable', timeout=60)

Historical Alert Analysis

# Query to find alerts that fired without incidents
count by (alertname) (
  ALERTS{alertstate="firing"}[30d]
) unless on (alertname) (
  count by (alertname) (
    incident_created{source="alert"}[30d]
  )
)

Alert Quality Metrics

Alert Precision

Precision = True Positives / (True Positives + False Positives)

Track alerts that resulted in actual incidents vs false alarms.

Time to Resolution

# Average time from alert firing to resolution
avg_over_time(
  (alert_resolved_timestamp - alert_fired_timestamp)[30d]
) by (alertname)

Alert Fatigue Indicators

# Alerts per day by team
sum by (team) (
  increase(alerts_fired_total[1d])
)

# Percentage of alerts acknowledged within 15 minutes
sum(alerts_acked_within_15m) / sum(alerts_fired) * 100

Advanced Patterns

Machine Learning-Enhanced Alerting

Anomaly Detection

- alert: AnomalousTraffic
  expr: |
    abs(request_rate - predict_linear(request_rate[1h], 300)) / 
    stddev_over_time(request_rate[1h]) > 3
  for: 10m
  annotations:
    summary: "Traffic pattern is anomalous"
    description: "Current traffic deviates from predicted pattern by >3 standard deviations"

Dynamic Thresholds

- alert: DynamicHighLatency
  expr: |
    latency_p95 > (
      quantile_over_time(0.95, latency_p95[7d]) +  # Historical 95th percentile
      2 * stddev_over_time(latency_p95[7d])        # Plus 2 standard deviations
    )

Business Hours Awareness

# Different thresholds for business vs off hours
- alert: HighLatencyBusinessHours  
  expr: latency_p95 > 0.2  # Stricter during business hours
  for: 2m
  # Active 9 AM - 5 PM weekdays
  
- alert: HighLatencyOffHours
  expr: latency_p95 > 0.5  # More lenient after hours  
  for: 5m
  # Active nights and weekends

Progressive Alerting

# Escalating alert severity based on duration
- alert: ServiceLatencyElevated
  expr: latency_p95 > 0.5
  for: 5m
  labels:
    severity: info
    
- alert: ServiceLatencyHigh
  expr: latency_p95 > 0.5
  for: 15m  # Same condition, longer duration
  labels:
    severity: warning
    
- alert: ServiceLatencyCritical  
  expr: latency_p95 > 0.5
  for: 30m  # Same condition, even longer duration
  labels:
    severity: critical

Anti-Patterns to Avoid

Anti-Pattern 1: Alerting on Everything

Problem: Too many alerts create noise and fatigue Solution: Be selective; only alert on user-impacting issues

Anti-Pattern 2: Vague Alert Messages

Problem: "Service X is down" - which instance? what's the impact? Solution: Include specific details and context

Anti-Pattern 3: Alerts Without Runbooks

Problem: Alerts that don't explain what to do Solution: Every alert must have an associated runbook

Anti-Pattern 4: Static Thresholds

Problem: 80% CPU might be normal during peak hours Solution: Use contextual, adaptive thresholds

Anti-Pattern 5: Ignoring Alert Quality

Problem: Accepting high false positive rates Solution: Regularly review and tune alert precision

Implementation Checklist

Pre-Implementation

Define alert severity levels and escalation policies
Create runbook templates
Set up alert routing configuration
Define SLOs that alerts will protect

Alert Development

Each alert has clear success criteria
Alert conditions tested against historical data
Runbook created and accessible
Severity and routing configured
Context and suggested actions included

Post-Implementation

Monitor alert precision and recall
Regular review of alert fatigue metrics
Quarterly alert effectiveness review
Team training on alert response procedures

Quality Assurance

Test alerts fire during controlled failures
Verify alerts resolve when conditions improve
Confirm runbooks are accurate and helpful
Validate escalation paths work correctly

Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM.

12 KiB Raw Blame History