Files
2026-03-12 15:17:52 +07:00

12 KiB

Observability Designer

A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.

Overview

The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:

  • SLO Designer: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
  • Alert Optimizer: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
  • Dashboard Generator: Create comprehensive dashboard specifications with role-based layouts and drill-down paths

Quick Start

Prerequisites

  • Python 3.7+
  • No external dependencies required (uses Python standard library only)

Basic Usage

# Generate SLO framework for a service
python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service

# Optimize existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate a dashboard specification
python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre

Scripts Documentation

SLO Designer (slo_designer.py)

Generates comprehensive SLO frameworks based on service characteristics.

Features

  • Automatic SLI Selection: Recommends appropriate SLIs based on service type
  • Target Setting: Suggests SLO targets based on service criticality
  • Error Budget Calculation: Computes error budgets and burn rate thresholds
  • Multi-Window Burn Rate Alerts: Generates 4-window burn rate alerting rules
  • SLA Recommendations: Provides customer-facing SLA guidance

Usage Examples

# From service definition file
python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json

# From command line parameters
python3 scripts/slo_designer.py \
    --service-type api \
    --criticality critical \
    --user-facing true \
    --service-name payment-service \
    --output payment_slos.json

# Generate and display summary only
python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only

Service Definition Format

{
  "name": "payment-service",
  "type": "api",
  "criticality": "critical",
  "user_facing": true,
  "description": "Handles payment processing",
  "team": "payments",
  "environment": "production",
  "dependencies": [
    {
      "name": "user-service",
      "type": "api",
      "criticality": "high"
    }
  ]
}

Supported Service Types

  • api: REST APIs, GraphQL services
  • web: Web applications, SPAs
  • database: Database services, data stores
  • queue: Message queues, event streams
  • batch: Batch processing jobs
  • ml: Machine learning services

Criticality Levels

  • critical: 99.99% availability, <100ms P95 latency, <0.1% error rate
  • high: 99.9% availability, <200ms P95 latency, <0.5% error rate
  • medium: 99.5% availability, <500ms P95 latency, <1% error rate
  • low: 99% availability, <1s P95 latency, <2% error rate

Alert Optimizer (alert_optimizer.py)

Analyzes existing alert configurations and provides optimization recommendations.

Features

  • Noise Detection: Identifies alerts with high false positive rates
  • Coverage Analysis: Finds gaps in monitoring coverage
  • Duplicate Detection: Locates redundant or overlapping alerts
  • Threshold Analysis: Reviews alert thresholds for appropriateness
  • Fatigue Assessment: Evaluates alert volume and routing

Usage Examples

# Analyze existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate optimized configuration
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --output optimized_alerts.json

# Generate HTML report
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --report alert_analysis.html \
    --format html

Alert Configuration Format

{
  "alerts": [
    {
      "alert": "HighLatency",
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
      "for": "5m",
      "labels": {
        "severity": "warning",
        "service": "payment-service"
      },
      "annotations": {
        "summary": "High request latency detected",
        "runbook_url": "https://runbooks.company.com/high-latency"
      },
      "historical_data": {
        "fires_per_day": 2.5,
        "false_positive_rate": 0.15
      }
    }
  ],
  "services": [
    {
      "name": "payment-service",
      "criticality": "critical"
    }
  ]
}

Analysis Categories

  • Golden Signals: Latency, traffic, errors, saturation
  • Resource Utilization: CPU, memory, disk, network
  • Business Metrics: Revenue, conversion, user engagement
  • Security: Auth failures, suspicious activity
  • Availability: Uptime, health checks

Dashboard Generator (dashboard_generator.py)

Creates comprehensive dashboard specifications with role-based optimization.

Features

  • Role-Based Layouts: Optimized for SRE, Developer, Executive, and Ops personas
  • Golden Signals Coverage: Automatic inclusion of key monitoring metrics
  • Service-Type Specific Panels: Tailored panels based on service characteristics
  • Interactive Elements: Template variables, drill-down paths, time range controls
  • Grafana Compatibility: Generates Grafana-compatible JSON

Usage Examples

# From service definition
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_web.json \
    --output dashboard.json

# With specific role optimization
python3 scripts/dashboard_generator.py \
    --service-type api \
    --name "Payment Service" \
    --role developer \
    --output payment_dev_dashboard.json

# Generate Grafana-compatible JSON
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_api.json \
    --output dashboard.json \
    --format grafana

# With documentation
python3 scripts/dashboard_generator.py \
    --service-type web \
    --name "Customer Portal" \
    --output portal_dashboard.json \
    --doc-output portal_docs.md

Target Roles

  • sre: Focus on availability, latency, errors, resource utilization
  • developer: Emphasize latency, errors, throughput, business metrics
  • executive: Highlight availability, business metrics, user experience
  • ops: Priority on resource utilization, capacity, alerts, deployments

Panel Types

  • Stat: Single value displays with thresholds
  • Gauge: Resource utilization and capacity metrics
  • Timeseries: Trend analysis and historical data
  • Table: Top N lists and detailed breakdowns
  • Heatmap: Distribution and correlation analysis

Sample Data

The assets/ directory contains sample configurations for testing:

  • sample_service_api.json: Critical API service definition
  • sample_service_web.json: High-priority web application definition
  • sample_alerts.json: Alert configuration with optimization opportunities

The expected_outputs/ directory shows example outputs from each script:

  • sample_slo_framework.json: Complete SLO framework for API service
  • optimized_alerts.json: Optimized alert configuration
  • sample_dashboard.json: SRE dashboard specification

Best Practices

SLO Design

  • Start with 1-2 SLOs per service and iterate
  • Choose SLIs that directly impact user experience
  • Set targets based on user needs, not technical capabilities
  • Use error budgets to balance reliability and velocity

Alert Optimization

  • Every alert must be actionable
  • Alert on symptoms, not causes
  • Use multi-window burn rate alerts for SLO protection
  • Implement proper escalation and routing policies

Dashboard Design

  • Follow the F-pattern for visual hierarchy
  • Use consistent color semantics across dashboards
  • Include drill-down paths for effective troubleshooting
  • Optimize for the target role's specific needs

Integration Patterns

CI/CD Integration

# Generate SLOs during service onboarding
python3 scripts/slo_designer.py --input service-config.json --output slos.json

# Validate alert configurations in pipeline
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html

# Auto-generate dashboards for new services
python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json

Monitoring Stack Integration

  • Prometheus: Generated alert rules and recording rules
  • Grafana: Dashboard JSON for direct import
  • Alertmanager: Routing and escalation policies
  • PagerDuty: Escalation configuration

GitOps Workflow

  1. Store service definitions in version control
  2. Generate observability configurations in CI/CD
  3. Deploy configurations via GitOps
  4. Monitor effectiveness and iterate

Advanced Usage

Custom SLO Targets

Override default targets by including them in service definitions:

{
  "name": "special-service",
  "type": "api",
  "criticality": "high",
  "custom_slos": {
    "availability_target": 0.9995,
    "latency_p95_target_ms": 150,
    "error_rate_target": 0.002
  }
}

Alert Rule Templates

Use template variables for reusable alert rules:

# Generated Prometheus alert rule
- alert: {{ service_name }}_HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
  for: 5m
  labels:
    severity: warning
    service: "{{ service_name }}"

Dashboard Variants

Generate multiple dashboard variants for different use cases:

# SRE operational dashboard
python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json

# Developer debugging dashboard  
python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json

# Executive business dashboard
python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json

Troubleshooting

Common Issues

Script Execution Errors

  • Ensure Python 3.7+ is installed
  • Check file paths and permissions
  • Validate JSON syntax in input files

Invalid Service Definitions

  • Required fields: name, type, criticality
  • Valid service types: api, web, database, queue, batch, ml
  • Valid criticality levels: critical, high, medium, low

Missing Historical Data

  • Alert historical data is optional but improves analysis
  • Include fires_per_day and false_positive_rate when available
  • Use monitoring system APIs to populate historical metrics

Debug Mode

Enable verbose logging by setting environment variable:

export DEBUG=1
python3 scripts/slo_designer.py --input service.json

Contributing

Development Setup

# Clone the repository
git clone <repository-url>
cd engineering/observability-designer

# Run tests
python3 -m pytest tests/

# Lint code
python3 -m flake8 scripts/

Adding New Features

  1. Follow existing code patterns and error handling
  2. Include comprehensive docstrings and type hints
  3. Add test cases for new functionality
  4. Update documentation and examples

Support

For questions, issues, or feature requests:

  • Check existing documentation and examples
  • Review the reference materials in references/
  • Open an issue with detailed reproduction steps
  • Include sample configurations when reporting bugs

This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the references/ directory.