add brain

2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions
--- a/.brain/.agent/skills/engineering-advanced-skills/observability-designer/references/alert_design_patterns.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/observability-designer/references/alert_design_patterns.md
@@ -0,0 +1,469 @@
+# Alert Design Patterns: A Guide to Effective Alerting
+
+## Introduction
+
+Well-designed alerts are the difference between a reliable system and 3 AM pages about non-issues. This guide provides patterns and anti-patterns for creating alerts that provide value without causing fatigue.
+
+## Fundamental Principles
+
+### The Golden Rules of Alerting
+
+1. **Every alert should be actionable** - If you can't do something about it, don't alert
+2. **Every alert should require human intelligence** - If a script can handle it, automate the response
+3. **Every alert should be novel** - Don't alert on known, ongoing issues
+4. **Every alert should represent a user-visible impact** - Internal metrics matter only if users are affected
+
+### Alert Classification
+
+#### Critical Alerts
+- Service is completely down
+- Data loss is occurring
+- Security breach detected
+- SLO burn rate indicates imminent SLO violation
+
+#### Warning Alerts  
+- Service degradation affecting some users
+- Approaching resource limits
+- Dependent service issues
+- Elevated error rates within SLO
+
+#### Info Alerts
+- Deployment notifications
+- Capacity planning triggers
+- Configuration changes
+- Maintenance windows
+
+## Alert Design Patterns
+
+### Pattern 1: Symptoms, Not Causes
+
+**Good**: Alert on user-visible symptoms
+```yaml
+- alert: HighLatency
+  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
+  for: 5m
+  annotations:
+    summary: "API latency is high"
+    description: "95th percentile latency is {{ $value }}s, above 500ms threshold"
+```
+
+**Bad**: Alert on internal metrics that may not affect users
+```yaml
+- alert: HighCPU
+  expr: cpu_usage > 80
+  # This might not affect users at all!
+```
+
+### Pattern 2: Multi-Window Alerting
+
+Reduce false positives by requiring sustained problems:
+
+```yaml
+- alert: ServiceDown
+  expr: (
+    avg_over_time(up[2m]) == 0  # Short window: immediate detection
+    and
+    avg_over_time(up[10m]) < 0.8  # Long window: avoid flapping
+  )
+  for: 1m
+```
+
+### Pattern 3: Burn Rate Alerting
+
+Alert based on error budget consumption rate:
+
+```yaml
+# Fast burn: 2% of monthly budget in 1 hour
+- alert: ErrorBudgetFastBurn  
+  expr: (
+    error_rate_5m > (14.4 * error_budget_slo)
+    and
+    error_rate_1h > (14.4 * error_budget_slo)
+  )
+  for: 2m
+  labels:
+    severity: critical
+    
+# Slow burn: 10% of monthly budget in 3 days
+- alert: ErrorBudgetSlowBurn
+  expr: (
+    error_rate_6h > (1.0 * error_budget_slo)
+    and  
+    error_rate_3d > (1.0 * error_budget_slo)
+  )
+  for: 15m
+  labels:
+    severity: warning
+```
+
+### Pattern 4: Hysteresis
+
+Use different thresholds for firing and resolving to prevent flapping:
+
+```yaml
+- alert: HighErrorRate
+  expr: error_rate > 0.05  # Fire at 5%
+  for: 5m
+  
+# Resolution happens automatically when error_rate < 0.03 (3%)
+# This prevents flapping around the 5% threshold
+```
+
+### Pattern 5: Composite Alerts
+
+Alert when multiple conditions indicate a problem:
+
+```yaml
+- alert: ServiceDegraded
+  expr: (
+    (latency_p95 > latency_threshold)
+    or
+    (error_rate > error_threshold)
+    or 
+    (availability < availability_threshold)
+  ) and (
+    request_rate > min_request_rate  # Only alert if we have traffic
+  )
+```
+
+### Pattern 6: Contextual Alerting
+
+Include relevant context in alerts:
+
+```yaml
+- alert: DatabaseConnections
+  expr: db_connections_active / db_connections_max > 0.8
+  for: 5m
+  annotations:
+    summary: "Database connection pool nearly exhausted"
+    description: "{{ $labels.database }} has {{ $value | humanizePercentage }} connection utilization"
+    runbook_url: "https://runbooks.company.com/database-connections"
+    impact: "New requests may be rejected, causing 500 errors"
+    suggested_action: "Check for connection leaks or increase pool size"
+```
+
+## Alert Routing and Escalation
+
+### Routing by Impact and Urgency
+
+#### Critical Path Services
+```yaml
+route:
+  group_by: ['service']
+  routes:
+  - match:
+      service: 'payment-api'
+      severity: 'critical'
+    receiver: 'payment-team-pager'
+    continue: true
+  - match:
+      service: 'payment-api' 
+      severity: 'warning'
+    receiver: 'payment-team-slack'
+```
+
+#### Time-Based Routing
+```yaml
+route:
+  routes:
+  - match:
+      severity: 'critical'
+    receiver: 'oncall-pager'
+  - match:
+      severity: 'warning'
+      time: 'business_hours'  # 9 AM - 5 PM
+    receiver: 'team-slack'
+  - match:
+      severity: 'warning'
+      time: 'after_hours'
+    receiver: 'team-email'  # Lower urgency outside business hours
+```
+
+### Escalation Patterns
+
+#### Linear Escalation
+```yaml
+receivers:
+- name: 'primary-oncall'
+  pagerduty_configs:
+  - escalation_policy: 'P1-Escalation'
+    # 0 min: Primary on-call
+    # 5 min: Secondary on-call  
+    # 15 min: Engineering manager
+    # 30 min: Director of engineering
+```
+
+#### Severity-Based Escalation
+```yaml
+# Critical: Immediate escalation
+- match:
+    severity: 'critical'
+  receiver: 'critical-escalation'
+  
+# Warning: Team-first escalation
+- match:
+    severity: 'warning'
+  receiver: 'team-escalation'
+```
+
+## Alert Fatigue Prevention
+
+### Grouping and Suppression
+
+#### Time-Based Grouping
+```yaml
+route:
+  group_wait: 30s        # Wait 30s to group similar alerts
+  group_interval: 2m     # Send grouped alerts every 2 minutes
+  repeat_interval: 1h    # Re-send unresolved alerts every hour
+```
+
+#### Dependent Service Suppression
+```yaml
+- alert: ServiceDown
+  expr: up == 0
+  
+- alert: HighLatency
+  expr: latency_p95 > 1
+  # This alert is suppressed when ServiceDown is firing
+  inhibit_rules:
+  - source_match:
+      alertname: 'ServiceDown'
+    target_match:
+      alertname: 'HighLatency'
+    equal: ['service']
+```
+
+### Alert Throttling
+
+```yaml
+# Limit to 1 alert per 10 minutes for noisy conditions
+- alert: HighMemoryUsage
+  expr: memory_usage_percent > 85
+  for: 10m  # Longer 'for' duration reduces noise
+  annotations:
+    summary: "Memory usage has been high for 10+ minutes"
+```
+
+### Smart Defaults
+
+```yaml
+# Use business logic to set intelligent thresholds
+- alert: LowTraffic
+  expr: request_rate < (
+    avg_over_time(request_rate[7d]) * 0.1  # 10% of weekly average
+  )
+  # Only alert during business hours when low traffic is unusual
+  for: 30m
+```
+
+## Runbook Integration
+
+### Runbook Structure Template
+
+```markdown
+# Alert: {{ $labels.alertname }}
+
+## Immediate Actions
+1. Check service status dashboard
+2. Verify if users are affected
+3. Look at recent deployments/changes
+
+## Investigation Steps
+1. Check logs for errors in the last 30 minutes
+2. Verify dependent services are healthy  
+3. Check resource utilization (CPU, memory, disk)
+4. Review recent alerts for patterns
+
+## Resolution Actions
+- If deployment-related: Consider rollback
+- If resource-related: Scale up or optimize queries
+- If dependency-related: Engage appropriate team
+
+## Escalation
+- Primary: @team-oncall
+- Secondary: @engineering-manager  
+- Emergency: @site-reliability-team
+```
+
+### Runbook Integration in Alerts
+
+```yaml
+annotations:
+  runbook_url: "https://runbooks.company.com/alerts/{{ $labels.alertname }}"
+  quick_debug: |
+    1. curl -s https://{{ $labels.instance }}/health
+    2. kubectl logs {{ $labels.pod }} --tail=50
+    3. Check dashboard: https://grafana.company.com/d/service-{{ $labels.service }}
+```
+
+## Testing and Validation
+
+### Alert Testing Strategies
+
+#### Chaos Engineering Integration
+```python
+# Test that alerts fire during controlled failures
+def test_alert_during_cpu_spike():
+    with chaos.cpu_spike(target='payment-api', duration='2m'):
+        assert wait_for_alert('HighCPU', timeout=180)
+        
+def test_alert_during_network_partition():
+    with chaos.network_partition(target='database'):
+        assert wait_for_alert('DatabaseUnreachable', timeout=60)
+```
+
+#### Historical Alert Analysis
+```prometheus
+# Query to find alerts that fired without incidents
+count by (alertname) (
+  ALERTS{alertstate="firing"}[30d]
+) unless on (alertname) (
+  count by (alertname) (
+    incident_created{source="alert"}[30d]
+  )
+)
+```
+
+### Alert Quality Metrics
+
+#### Alert Precision
+```
+Precision = True Positives / (True Positives + False Positives)
+```
+
+Track alerts that resulted in actual incidents vs false alarms.
+
+#### Time to Resolution
+```prometheus
+# Average time from alert firing to resolution
+avg_over_time(
+  (alert_resolved_timestamp - alert_fired_timestamp)[30d]
+) by (alertname)
+```
+
+#### Alert Fatigue Indicators
+```prometheus
+# Alerts per day by team
+sum by (team) (
+  increase(alerts_fired_total[1d])
+)
+
+# Percentage of alerts acknowledged within 15 minutes
+sum(alerts_acked_within_15m) / sum(alerts_fired) * 100
+```
+
+## Advanced Patterns
+
+### Machine Learning-Enhanced Alerting
+
+#### Anomaly Detection
+```yaml
+- alert: AnomalousTraffic
+  expr: |
+    abs(request_rate - predict_linear(request_rate[1h], 300)) / 
+    stddev_over_time(request_rate[1h]) > 3
+  for: 10m
+  annotations:
+    summary: "Traffic pattern is anomalous"
+    description: "Current traffic deviates from predicted pattern by >3 standard deviations"
+```
+
+#### Dynamic Thresholds
+```yaml
+- alert: DynamicHighLatency
+  expr: |
+    latency_p95 > (
+      quantile_over_time(0.95, latency_p95[7d]) +  # Historical 95th percentile
+      2 * stddev_over_time(latency_p95[7d])        # Plus 2 standard deviations
+    )
+```
+
+### Business Hours Awareness
+
+```yaml
+# Different thresholds for business vs off hours
+- alert: HighLatencyBusinessHours  
+  expr: latency_p95 > 0.2  # Stricter during business hours
+  for: 2m
+  # Active 9 AM - 5 PM weekdays
+  
+- alert: HighLatencyOffHours
+  expr: latency_p95 > 0.5  # More lenient after hours  
+  for: 5m
+  # Active nights and weekends
+```
+
+### Progressive Alerting
+
+```yaml
+# Escalating alert severity based on duration
+- alert: ServiceLatencyElevated
+  expr: latency_p95 > 0.5
+  for: 5m
+  labels:
+    severity: info
+    
+- alert: ServiceLatencyHigh
+  expr: latency_p95 > 0.5
+  for: 15m  # Same condition, longer duration
+  labels:
+    severity: warning
+    
+- alert: ServiceLatencyCritical  
+  expr: latency_p95 > 0.5
+  for: 30m  # Same condition, even longer duration
+  labels:
+    severity: critical
+```
+
+## Anti-Patterns to Avoid
+
+### Anti-Pattern 1: Alerting on Everything
+**Problem**: Too many alerts create noise and fatigue
+**Solution**: Be selective; only alert on user-impacting issues
+
+### Anti-Pattern 2: Vague Alert Messages
+**Problem**: "Service X is down" - which instance? what's the impact?
+**Solution**: Include specific details and context
+
+### Anti-Pattern 3: Alerts Without Runbooks
+**Problem**: Alerts that don't explain what to do
+**Solution**: Every alert must have an associated runbook
+
+### Anti-Pattern 4: Static Thresholds
+**Problem**: 80% CPU might be normal during peak hours
+**Solution**: Use contextual, adaptive thresholds
+
+### Anti-Pattern 5: Ignoring Alert Quality
+**Problem**: Accepting high false positive rates
+**Solution**: Regularly review and tune alert precision
+
+## Implementation Checklist
+
+### Pre-Implementation
+- [ ] Define alert severity levels and escalation policies
+- [ ] Create runbook templates
+- [ ] Set up alert routing configuration
+- [ ] Define SLOs that alerts will protect
+
+### Alert Development
+- [ ] Each alert has clear success criteria
+- [ ] Alert conditions tested against historical data
+- [ ] Runbook created and accessible
+- [ ] Severity and routing configured
+- [ ] Context and suggested actions included
+
+### Post-Implementation  
+- [ ] Monitor alert precision and recall
+- [ ] Regular review of alert fatigue metrics
+- [ ] Quarterly alert effectiveness review
+- [ ] Team training on alert response procedures
+
+### Quality Assurance
+- [ ] Test alerts fire during controlled failures
+- [ ] Verify alerts resolve when conditions improve
+- [ ] Confirm runbooks are accurate and helpful
+- [ ] Validate escalation paths work correctly
+
+Remember: Great alerts are invisible when things work and invaluable when things break. Focus on quality over quantity, and always optimize for the human who will respond to the alert at 3 AM.
--- a/.brain/.agent/skills/engineering-advanced-skills/observability-designer/references/dashboard_best_practices.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/observability-designer/references/dashboard_best_practices.md
@@ -0,0 +1,571 @@
+# Dashboard Best Practices: Design for Insight and Action
+
+## Introduction
+
+A well-designed dashboard is like a good story - it guides you through the data with purpose and clarity. This guide provides practical patterns for creating dashboards that inform decisions and enable quick troubleshooting.
+
+## Design Principles
+
+### The Hierarchy of Information
+
+#### Primary Information (Top Third)
+- Service health status
+- SLO achievement
+- Critical alerts
+- Business KPIs
+
+#### Secondary Information (Middle Third)  
+- Golden signals (latency, traffic, errors, saturation)
+- Resource utilization
+- Throughput and performance metrics
+
+#### Tertiary Information (Bottom Third)
+- Detailed breakdowns
+- Historical trends
+- Dependency status
+- Debug information
+
+### Visual Design Principles
+
+#### Rule of 7±2
+- Maximum 7±2 panels per screen
+- Group related information together
+- Use sections to organize complexity
+
+#### Color Psychology
+- **Red**: Critical issues, danger, immediate attention needed
+- **Yellow/Orange**: Warnings, caution, degraded state
+- **Green**: Healthy, normal operation, success
+- **Blue**: Information, neutral metrics, capacity
+- **Gray**: Disabled, unknown, or baseline states
+
+#### Chart Selection Guide
+- **Line charts**: Time series, trends, comparisons over time
+- **Bar charts**: Categorical comparisons, top N lists
+- **Gauges**: Single value with defined good/bad ranges
+- **Stat panels**: Key metrics, percentages, counts
+- **Heatmaps**: Distribution data, correlation analysis
+- **Tables**: Detailed breakdowns, multi-dimensional data
+
+## Dashboard Archetypes
+
+### The Overview Dashboard
+
+**Purpose**: High-level health check and business metrics
+**Audience**: Executives, managers, cross-team stakeholders
+**Update Frequency**: 5-15 minutes
+
+```yaml
+sections:
+  - title: "Business Health"
+    panels:
+      - service_availability_summary
+      - revenue_per_hour  
+      - active_users
+      - conversion_rate
+      
+  - title: "System Health"  
+    panels:
+      - critical_alerts_count
+      - slo_achievement_summary
+      - error_budget_remaining
+      - deployment_status
+```
+
+### The SRE Operational Dashboard
+
+**Purpose**: Real-time monitoring and incident response
+**Audience**: SRE, on-call engineers
+**Update Frequency**: 15-30 seconds
+
+```yaml
+sections:
+  - title: "Service Status"
+    panels:
+      - service_up_status
+      - active_incidents
+      - recent_deployments
+      
+  - title: "Golden Signals"
+    panels:
+      - latency_percentiles
+      - request_rate
+      - error_rate  
+      - resource_saturation
+      
+  - title: "Infrastructure"
+    panels:
+      - cpu_memory_utilization
+      - network_io
+      - disk_space
+```
+
+### The Developer Debug Dashboard
+
+**Purpose**: Deep-dive troubleshooting and performance analysis
+**Audience**: Development teams
+**Update Frequency**: 30 seconds - 2 minutes
+
+```yaml
+sections:
+  - title: "Application Performance"
+    panels:
+      - endpoint_latency_breakdown
+      - database_query_performance
+      - cache_hit_rates
+      - queue_depths
+      
+  - title: "Errors and Logs"
+    panels:
+      - error_rate_by_endpoint
+      - log_volume_by_level
+      - exception_types
+      - slow_queries
+```
+
+## Layout Patterns
+
+### The F-Pattern Layout
+
+Based on eye-tracking studies, users scan in an F-pattern:
+
+```
+[Critical Status] [SLO Summary  ] [Error Budget ]
+[Latency       ] [Traffic      ] [Errors       ]
+[Saturation    ] [Resource Use ] [Detailed View]
+[Historical    ] [Dependencies ] [Debug Info   ]
+```
+
+### The Z-Pattern Layout  
+
+For executive dashboards, follow the Z-pattern:
+
+```
+[Business KPIs          ] → [System Status]
+      ↓                          ↓
+[Trend Analysis         ] ← [Key Metrics ]
+```
+
+### Responsive Design
+
+#### Desktop (1920x1080)
+- 24-column grid
+- Panels can be 6, 8, 12, or 24 units wide
+- 4-6 rows visible without scrolling
+
+#### Laptop (1366x768)
+- Stack wider panels vertically
+- Reduce panel heights
+- Prioritize most critical information
+
+#### Mobile (768px width)
+- Single column layout
+- Simplified panels
+- Touch-friendly controls
+
+## Effective Panel Design
+
+### Stat Panels
+
+```yaml
+# Good: Clear value with context
+- title: "API Availability"
+  type: stat
+  targets:
+    - expr: avg(up{service="api"}) * 100
+  field_config:
+    unit: percent
+    thresholds:
+      steps:
+        - color: red
+          value: 0
+        - color: yellow  
+          value: 99
+        - color: green
+          value: 99.9
+  options:
+    color_mode: background
+    text_mode: value_and_name
+```
+
+### Time Series Panels
+
+```yaml  
+# Good: Multiple related metrics with clear legend
+- title: "Request Latency"
+  type: timeseries
+  targets:
+    - expr: histogram_quantile(0.50, rate(http_duration_bucket[5m]))
+      legend: "P50"
+    - expr: histogram_quantile(0.95, rate(http_duration_bucket[5m]))
+      legend: "P95"  
+    - expr: histogram_quantile(0.99, rate(http_duration_bucket[5m]))
+      legend: "P99"
+  field_config:
+    unit: ms
+    custom:
+      draw_style: line
+      fill_opacity: 10
+  options:
+    legend:
+      display_mode: table
+      placement: bottom
+      values: [min, max, mean, last]
+```
+
+### Table Panels
+
+```yaml
+# Good: Top N with relevant columns
+- title: "Slowest Endpoints"
+  type: table
+  targets:
+    - expr: topk(10, histogram_quantile(0.95, sum by (handler)(rate(http_duration_bucket[5m]))))
+      format: table
+      instant: true
+  transformations:
+    - id: organize
+      options:
+        exclude_by_name: 
+          Time: true
+        rename_by_name:
+          Value: "P95 Latency (ms)"
+          handler: "Endpoint"
+```
+
+## Color and Visualization Best Practices
+
+### Threshold Configuration
+
+```yaml
+# Traffic light system with meaningful boundaries
+thresholds:
+  steps:
+    - color: green     # Good performance
+      value: null      # Default
+    - color: yellow    # Degraded performance  
+      value: 95        # 95th percentile of historical normal
+    - color: orange    # Poor performance
+      value: 99        # 99th percentile of historical normal
+    - color: red       # Critical performance
+      value: 99.9      # Worst case scenario
+```
+
+### Color Blind Friendly Palettes
+
+```yaml
+# Use patterns and shapes in addition to color
+field_config:
+  overrides:
+    - matcher:
+        id: byName
+        options: "Critical"
+      properties:
+        - id: color
+          value:
+            mode: fixed
+            fixed_color: "#d73027"  # Red-orange for protanopia
+        - id: custom.draw_style
+          value: "points"           # Different shape
+```
+
+### Consistent Color Semantics
+
+- **Success/Health**: Green (#28a745)
+- **Warning/Degraded**: Yellow (#ffc107)  
+- **Error/Critical**: Red (#dc3545)
+- **Information**: Blue (#007bff)
+- **Neutral**: Gray (#6c757d)
+
+## Time Range Strategy
+
+### Default Time Ranges by Dashboard Type
+
+#### Real-time Operational
+- **Default**: Last 15 minutes
+- **Quick options**: 5m, 15m, 1h, 4h
+- **Auto-refresh**: 15-30 seconds
+
+#### Troubleshooting  
+- **Default**: Last 1 hour
+- **Quick options**: 15m, 1h, 4h, 12h, 1d
+- **Auto-refresh**: 1 minute
+
+#### Business Review
+- **Default**: Last 24 hours
+- **Quick options**: 1d, 7d, 30d, 90d
+- **Auto-refresh**: 5 minutes
+
+#### Capacity Planning
+- **Default**: Last 7 days  
+- **Quick options**: 7d, 30d, 90d, 1y
+- **Auto-refresh**: 15 minutes
+
+### Time Range Annotations
+
+```yaml
+# Add context for time-based events
+annotations:
+  - name: "Deployments"
+    datasource: "Prometheus"
+    expr: "deployment_timestamp"
+    title_format: "Deploy {{ version }}"
+    text_format: "Deployed version {{ version }} to {{ environment }}"
+    
+  - name: "Incidents"  
+    datasource: "Incident API"
+    query: "incidents.json?service={{ service }}"
+    color: "red"
+```
+
+## Interactive Features
+
+### Template Variables
+
+```yaml
+# Service selector
+- name: service
+  type: query
+  query: label_values(up, service)
+  current:
+    text: All
+    value: $__all
+  include_all: true
+  multi: true
+  
+# Environment selector  
+- name: environment
+  type: query
+  query: label_values(up{service="$service"}, environment)
+  current:
+    text: production
+    value: production
+```
+
+### Drill-Down Links
+
+```yaml
+# Panel-level drill-downs
+- title: "Error Rate"
+  type: timeseries
+  # ... other config ...
+  options:
+    data_links:
+      - title: "View Error Logs"
+        url: "/d/logs-dashboard?var-service=${__field.labels.service}&from=${__from}&to=${__to}"
+      - title: "Error Traces"  
+        url: "/d/traces-dashboard?var-service=${__field.labels.service}"
+```
+
+### Dynamic Panel Titles
+
+```yaml
+- title: "${service} - Request Rate"  # Uses template variable
+  type: timeseries
+  # Title updates automatically when service variable changes
+```
+
+## Performance Optimization
+
+### Query Optimization
+
+#### Use Recording Rules
+```yaml
+# Instead of complex queries in dashboards
+groups:
+  - name: http_requests
+    rules:
+      - record: http_request_rate_5m
+        expr: sum(rate(http_requests_total[5m])) by (service, method, handler)
+        
+      - record: http_request_latency_p95_5m
+        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
+```
+
+#### Limit Data Points
+```yaml
+# Good: Reasonable resolution for dashboard
+- expr: http_request_rate_5m[1h]
+  interval: 15s  # One point every 15 seconds
+
+# Bad: Too many points for visualization  
+- expr: http_request_rate_1s[1h]  # 3600 points!
+```
+
+### Dashboard Performance
+
+#### Panel Limits
+- **Maximum panels per dashboard**: 20-30
+- **Maximum queries per panel**: 10
+- **Maximum time series per panel**: 50
+
+#### Caching Strategy
+```yaml
+# Use appropriate cache headers
+cache_timeout: 30  # Cache for 30 seconds on fast-changing panels
+cache_timeout: 300 # Cache for 5 minutes on slow-changing panels
+```
+
+## Accessibility
+
+### Screen Reader Support
+
+```yaml
+# Provide text alternatives for visual elements
+- title: "Service Health Status"
+  type: stat
+  options:
+    text_mode: value_and_name  # Includes both value and description
+  field_config:
+    mappings:
+      - options:
+          "1": 
+            text: "Healthy"
+            color: "green"
+          "0":
+            text: "Unhealthy"  
+            color: "red"
+```
+
+### Keyboard Navigation
+
+- Ensure all interactive elements are keyboard accessible
+- Provide logical tab order
+- Include skip links for complex dashboards
+
+### High Contrast Mode
+
+```yaml
+# Test dashboards work in high contrast mode
+theme: high_contrast
+colors:
+  - "#000000"  # Pure black
+  - "#ffffff"  # Pure white  
+  - "#ffff00"  # Pure yellow
+  - "#ff0000"  # Pure red
+```
+
+## Testing and Validation
+
+### Dashboard Testing Checklist
+
+#### Functional Testing
+- [ ] All panels load without errors
+- [ ] Template variables filter correctly
+- [ ] Time range changes update all panels
+- [ ] Drill-down links work as expected
+- [ ] Auto-refresh functions properly
+
+#### Visual Testing
+- [ ] Dashboard renders correctly on different screen sizes
+- [ ] Colors are distinguishable and meaningful
+- [ ] Text is readable at normal zoom levels
+- [ ] Legends and labels are clear
+
+#### Performance Testing  
+- [ ] Dashboard loads in < 5 seconds
+- [ ] No queries timeout under normal load
+- [ ] Auto-refresh doesn't cause browser lag
+- [ ] Memory usage remains reasonable
+
+#### Usability Testing
+- [ ] New team members can understand the dashboard
+- [ ] Action items are clear during incidents
+- [ ] Key information is quickly discoverable
+- [ ] Dashboard supports common troubleshooting workflows
+
+## Maintenance and Governance
+
+### Dashboard Lifecycle
+
+#### Creation
+1. Define dashboard purpose and audience
+2. Identify key metrics and success criteria
+3. Design layout following established patterns
+4. Implement with consistent styling
+5. Test with real data and user scenarios
+
+#### Maintenance
+- **Weekly**: Check for broken panels or queries
+- **Monthly**: Review dashboard usage analytics  
+- **Quarterly**: Gather user feedback and iterate
+- **Annually**: Major review and potential redesign
+
+#### Retirement
+- Archive dashboards that are no longer used
+- Migrate users to replacement dashboards
+- Document lessons learned
+
+### Dashboard Standards
+
+```yaml
+# Organization dashboard standards
+standards:
+  naming_convention: "[Team] [Service] - [Purpose]"
+  tags: [team, service_type, environment, purpose]
+  refresh_intervals: [15s, 30s, 1m, 5m, 15m]
+  time_ranges: [5m, 15m, 1h, 4h, 1d, 7d, 30d]
+  color_scheme: "company_standard"
+  max_panels_per_dashboard: 25
+```
+
+## Advanced Patterns
+
+### Composite Dashboards
+
+```yaml
+# Dashboard that includes panels from other dashboards
+- title: "Service Overview"
+  type: dashlist
+  targets:
+    - "service-health"
+    - "service-performance" 
+    - "service-business-metrics"
+  options:
+    show_headings: true
+    max_items: 10
+```
+
+### Dynamic Dashboard Generation
+
+```python
+# Generate dashboards from service definitions
+def generate_service_dashboard(service_config):
+    panels = []
+    
+    # Always include golden signals
+    panels.extend(generate_golden_signals_panels(service_config))
+    
+    # Add service-specific panels
+    if service_config.type == 'database':
+        panels.extend(generate_database_panels(service_config))
+    elif service_config.type == 'queue':
+        panels.extend(generate_queue_panels(service_config))
+        
+    return {
+        'title': f"{service_config.name} - Operational Dashboard",
+        'panels': panels,
+        'variables': generate_variables(service_config)
+    }
+```
+
+### A/B Testing for Dashboards
+
+```yaml
+# Test different dashboard designs with different teams
+experiment:
+  name: "dashboard_layout_test"
+  variants:
+    - name: "traditional_layout"
+      weight: 50
+      config: "dashboard_v1.json"
+    - name: "f_pattern_layout"  
+      weight: 50
+      config: "dashboard_v2.json"
+  success_metrics:
+    - "time_to_insight"
+    - "user_satisfaction"
+    - "troubleshooting_efficiency"
+```
+
+Remember: A dashboard should tell a story about your system's health and guide users toward the right actions. Focus on clarity over complexity, and always optimize for the person who will use it during a stressful incident.
--- a/.brain/.agent/skills/engineering-advanced-skills/observability-designer/references/slo_cookbook.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/observability-designer/references/slo_cookbook.md
@@ -0,0 +1,329 @@
+# SLO Cookbook: A Practical Guide to Service Level Objectives
+
+## Introduction
+
+Service Level Objectives (SLOs) are a key tool for managing service reliability. This cookbook provides practical guidance for implementing SLOs that actually improve system reliability rather than just creating meaningless metrics.
+
+## Fundamentals
+
+### The SLI/SLO/SLA Hierarchy
+
+- **SLI (Service Level Indicator)**: A quantifiable measure of service quality
+- **SLO (Service Level Objective)**: A target range of values for an SLI
+- **SLA (Service Level Agreement)**: A business agreement with consequences for missing SLO targets
+
+### Golden Rule of SLOs
+
+**Start simple, iterate based on learning.** Your first SLOs won't be perfect, and that's okay.
+
+## Choosing Good SLIs
+
+### The Four Golden Signals
+
+1. **Latency**: How long requests take to complete
+2. **Traffic**: How many requests are coming in
+3. **Errors**: How many requests are failing
+4. **Saturation**: How "full" your service is
+
+### SLI Selection Criteria
+
+A good SLI should be:
+- **Measurable**: You can collect data for it
+- **Meaningful**: It reflects user experience
+- **Controllable**: You can take action to improve it
+- **Proportional**: Changes in the SLI reflect changes in user happiness
+
+### Service Type Specific SLIs
+
+#### HTTP APIs
+- **Request latency**: P95 or P99 response time
+- **Availability**: Proportion of successful requests (non-5xx)
+- **Throughput**: Requests per second capacity
+
+```prometheus
+# Availability SLI
+sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
+
+# Latency SLI  
+histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
+```
+
+#### Batch Jobs
+- **Freshness**: Age of the last successful run
+- **Correctness**: Proportion of jobs completing successfully
+- **Throughput**: Items processed per unit time
+
+#### Data Pipelines
+- **Data freshness**: Time since last successful update
+- **Data quality**: Proportion of records passing validation
+- **Processing latency**: Time from ingestion to availability
+
+### Anti-Patterns in SLI Selection
+
+❌ **Don't use**: CPU usage, memory usage, disk space as primary SLIs
+- These are symptoms, not user-facing impacts
+
+❌ **Don't use**: Counts instead of rates or proportions
+- "Number of errors" vs "Error rate"
+
+❌ **Don't use**: Internal metrics that users don't care about
+- Queue depth, cache hit rate (unless they directly impact user experience)
+
+## Setting SLO Targets
+
+### The Art of Target Setting
+
+Setting SLO targets is balancing act between:
+- **User happiness**: Targets should reflect acceptable user experience
+- **Business value**: Tighter SLOs cost more to maintain
+- **Current performance**: Targets should be achievable but aspirational
+
+### Target Setting Strategies
+
+#### Historical Performance Method
+1. Collect 4-6 weeks of historical data
+2. Calculate the worst user-visible performance in that period
+3. Set your SLO slightly better than the worst acceptable performance
+
+#### User Journey Mapping
+1. Map critical user journeys
+2. Identify acceptable performance for each step
+3. Work backwards to component SLOs
+
+#### Error Budget Approach
+1. Decide how much unreliability you can afford
+2. Set SLO targets based on acceptable error budget consumption
+3. Example: 99.9% availability = 43.8 minutes downtime per month
+
+### SLO Target Examples by Service Criticality
+
+#### Critical Services (Revenue Impact)
+- **Availability**: 99.95% - 99.99%
+- **Latency (P95)**: 100-200ms
+- **Error Rate**: < 0.1%
+
+#### High Priority Services  
+- **Availability**: 99.9% - 99.95%
+- **Latency (P95)**: 200-500ms
+- **Error Rate**: < 0.5%
+
+#### Standard Services
+- **Availability**: 99.5% - 99.9%
+- **Latency (P95)**: 500ms - 1s
+- **Error Rate**: < 1%
+
+## Error Budget Management
+
+### What is an Error Budget?
+
+Your error budget is the maximum amount of unreliability you can accumulate while still meeting your SLO. It's calculated as:
+
+```
+Error Budget = (1 - SLO) × Time Window
+```
+
+For a 99.9% availability SLO over 30 days:
+```
+Error Budget = (1 - 0.999) × 30 days = 0.001 × 30 days = 43.8 minutes
+```
+
+### Error Budget Policies
+
+Define what happens when you consume your error budget:
+
+#### Conservative Policy (High-Risk Services)
+- **> 50% consumed**: Freeze non-critical feature releases
+- **> 75% consumed**: Focus entirely on reliability improvements  
+- **> 90% consumed**: Consider emergency measures (traffic shaping, etc.)
+
+#### Balanced Policy (Standard Services)
+- **> 75% consumed**: Increase focus on reliability work
+- **> 90% consumed**: Pause feature work, focus on reliability
+
+#### Aggressive Policy (Early Stage Services)
+- **> 90% consumed**: Review but continue normal operations
+- **100% consumed**: Evaluate SLO appropriateness
+
+### Burn Rate Alerting
+
+Multi-window burn rate alerts help you catch SLO violations before they become critical:
+
+```yaml
+# Fast burn: 2% budget consumed in 1 hour
+- alert: FastBurnSLOViolation
+  expr: (
+    (1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * 0.001)
+    and
+    (1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * 0.001)
+  )
+  for: 2m
+
+# Slow burn: 10% budget consumed in 3 days  
+- alert: SlowBurnSLOViolation
+  expr: (
+    (1 - (sum(rate(http_requests_total{code!~"5.."}[6h])) / sum(rate(http_requests_total[6h])))) > (1.0 * 0.001)
+    and
+    (1 - (sum(rate(http_requests_total{code!~"5.."}[3d])) / sum(rate(http_requests_total[3d])))) > (1.0 * 0.001)
+  )
+  for: 15m
+```
+
+## Implementation Patterns
+
+### The SLO Implementation Ladder
+
+#### Level 1: Basic SLOs
+- Choose 1-2 SLIs that matter most to users
+- Set aspirational but achievable targets
+- Implement basic alerting when SLOs are missed
+
+#### Level 2: Operational SLOs
+- Add burn rate alerting
+- Create error budget dashboards
+- Establish error budget policies
+- Regular SLO review meetings
+
+#### Level 3: Advanced SLOs
+- Multi-window burn rate alerts
+- Automated error budget policy enforcement
+- SLO-driven incident prioritization
+- Integration with CI/CD for deployment decisions
+
+### SLO Measurement Architecture
+
+#### Push vs Pull Metrics
+- **Pull** (Prometheus): Good for infrastructure metrics, real-time alerting
+- **Push** (StatsD): Good for application metrics, business events
+
+#### Measurement Points
+- **Server-side**: More reliable, easier to implement
+- **Client-side**: Better reflects user experience
+- **Synthetic**: Consistent, predictable, may not reflect real user experience
+
+### SLO Dashboard Design
+
+Essential elements for SLO dashboards:
+
+1. **Current SLO Achievement**: Large, prominent display
+2. **Error Budget Remaining**: Visual indicator (gauge, progress bar)
+3. **Burn Rate**: Time series showing error budget consumption rate
+4. **Historical Trends**: 4-week view of SLO achievement
+5. **Alerts**: Current and recent SLO-related alerts
+
+## Advanced Topics
+
+### Dependency SLOs
+
+For services with dependencies:
+
+```
+SLO_service ≤ min(SLO_inherent, ∏SLO_dependencies)
+```
+
+If your service depends on 3 other services each with 99.9% SLO:
+```
+Maximum_SLO = 0.999³ = 0.997 = 99.7%
+```
+
+### User Journey SLOs
+
+Track end-to-end user experiences:
+
+```prometheus
+# Registration success rate
+sum(rate(user_registration_success_total[5m])) / sum(rate(user_registration_attempts_total[5m]))
+
+# Purchase completion latency
+histogram_quantile(0.95, rate(purchase_completion_duration_seconds_bucket[5m]))
+```
+
+### SLOs for Batch Systems
+
+Special considerations for non-request/response systems:
+
+#### Freshness SLO
+```prometheus
+# Data should be no more than 4 hours old
+(time() - last_successful_update_timestamp) < (4 * 3600)
+```
+
+#### Throughput SLO
+```prometheus
+# Should process at least 1000 items per hour
+rate(items_processed_total[1h]) >= 1000
+```
+
+#### Quality SLO
+```prometheus
+# At least 99.5% of records should pass validation
+sum(rate(records_valid_total[5m])) / sum(rate(records_processed_total[5m])) >= 0.995
+```
+
+## Common Mistakes and How to Avoid Them
+
+### Mistake 1: Too Many SLOs
+**Problem**: Drowning in metrics, losing focus
+**Solution**: Start with 1-2 SLOs per service, add more only when needed
+
+### Mistake 2: Internal Metrics as SLIs
+**Problem**: Optimizing for metrics that don't impact users
+**Solution**: Always ask "If this metric changes, do users notice?"
+
+### Mistake 3: Perfectionist SLOs
+**Problem**: 99.99% SLO when 99.9% would be fine
+**Solution**: Higher SLOs cost exponentially more; pick the minimum acceptable level
+
+### Mistake 4: Ignoring Error Budgets
+**Problem**: Treating any SLO miss as an emergency
+**Solution**: Error budgets exist to be spent; use them to balance feature velocity and reliability
+
+### Mistake 5: Static SLOs
+**Problem**: Setting SLOs once and never updating them
+**Solution**: Review SLOs quarterly; adjust based on user feedback and business changes
+
+## SLO Review Process
+
+### Monthly SLO Review Agenda
+
+1. **SLO Achievement Review**: Did we meet our SLOs?
+2. **Error Budget Analysis**: How did we spend our error budget?
+3. **Incident Correlation**: Which incidents impacted our SLOs?
+4. **SLI Quality Assessment**: Are our SLIs still meaningful?
+5. **Target Adjustment**: Should we change any targets?
+
+### Quarterly SLO Health Check
+
+1. **User Impact Validation**: Survey users about acceptable performance
+2. **Business Alignment**: Do SLOs still reflect business priorities?
+3. **Measurement Quality**: Are we measuring the right things?
+4. **Cost/Benefit Analysis**: Are tighter SLOs worth the investment?
+
+## Tooling and Automation
+
+### Essential Tools
+
+1. **Metrics Collection**: Prometheus, InfluxDB, CloudWatch
+2. **Alerting**: Alertmanager, PagerDuty, OpsGenie  
+3. **Dashboards**: Grafana, DataDog, New Relic
+4. **SLO Platforms**: Sloth, Pyrra, Service Level Blue
+
+### Automation Opportunities
+
+- **Burn rate alert generation** from SLO definitions
+- **Dashboard creation** from SLO specifications
+- **Error budget calculation** and tracking
+- **Release blocking** based on error budget consumption
+
+## Getting Started Checklist
+
+- [ ] Identify your service's critical user journeys
+- [ ] Choose 1-2 SLIs that best reflect user experience
+- [ ] Collect 4-6 weeks of baseline data
+- [ ] Set initial SLO targets based on historical performance
+- [ ] Implement basic SLO monitoring and alerting
+- [ ] Create an SLO dashboard
+- [ ] Define error budget policies
+- [ ] Schedule monthly SLO reviews
+- [ ] Plan for quarterly SLO health checks
+
+Remember: SLOs are a journey, not a destination. Start simple, learn from experience, and iterate toward better reliability management.