Files

2026-03-12 15:17:52 +07:00

12 KiB

Raw Blame History

Observability Designer

A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.

Overview

The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:

SLO Designer: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
Alert Optimizer: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
Dashboard Generator: Create comprehensive dashboard specifications with role-based layouts and drill-down paths

Quick Start

Prerequisites

Python 3.7+
No external dependencies required (uses Python standard library only)

Basic Usage

# Generate SLO framework for a service
python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service

# Optimize existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate a dashboard specification
python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre

Scripts Documentation

SLO Designer (`slo_designer.py`)

Generates comprehensive SLO frameworks based on service characteristics.

Features

Automatic SLI Selection: Recommends appropriate SLIs based on service type
Target Setting: Suggests SLO targets based on service criticality
Error Budget Calculation: Computes error budgets and burn rate thresholds
Multi-Window Burn Rate Alerts: Generates 4-window burn rate alerting rules
SLA Recommendations: Provides customer-facing SLA guidance

Usage Examples

# From service definition file
python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json

# From command line parameters
python3 scripts/slo_designer.py \
    --service-type api \
    --criticality critical \
    --user-facing true \
    --service-name payment-service \
    --output payment_slos.json

# Generate and display summary only
python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only

Service Definition Format

{
  "name": "payment-service",
  "type": "api",
  "criticality": "critical",
  "user_facing": true,
  "description": "Handles payment processing",
  "team": "payments",
  "environment": "production",
  "dependencies": [
    {
      "name": "user-service",
      "type": "api",
      "criticality": "high"
    }
  ]
}

Supported Service Types

api: REST APIs, GraphQL services
web: Web applications, SPAs
database: Database services, data stores
queue: Message queues, event streams
batch: Batch processing jobs
ml: Machine learning services

Criticality Levels

critical: 99.99% availability, <100ms P95 latency, <0.1% error rate
high: 99.9% availability, <200ms P95 latency, <0.5% error rate
medium: 99.5% availability, <500ms P95 latency, <1% error rate
low: 99% availability, <1s P95 latency, <2% error rate

Alert Optimizer (`alert_optimizer.py`)

Analyzes existing alert configurations and provides optimization recommendations.

Features

Noise Detection: Identifies alerts with high false positive rates
Coverage Analysis: Finds gaps in monitoring coverage
Duplicate Detection: Locates redundant or overlapping alerts
Threshold Analysis: Reviews alert thresholds for appropriateness
Fatigue Assessment: Evaluates alert volume and routing

Usage Examples

# Analyze existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate optimized configuration
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --output optimized_alerts.json

# Generate HTML report
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --report alert_analysis.html \
    --format html

Alert Configuration Format

{
  "alerts": [
    {
      "alert": "HighLatency",
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
      "for": "5m",
      "labels": {
        "severity": "warning",
        "service": "payment-service"
      },
      "annotations": {
        "summary": "High request latency detected",
        "runbook_url": "https://runbooks.company.com/high-latency"
      },
      "historical_data": {
        "fires_per_day": 2.5,
        "false_positive_rate": 0.15
      }
    }
  ],
  "services": [
    {
      "name": "payment-service",
      "criticality": "critical"
    }
  ]
}

Analysis Categories

Golden Signals: Latency, traffic, errors, saturation
Resource Utilization: CPU, memory, disk, network
Business Metrics: Revenue, conversion, user engagement
Security: Auth failures, suspicious activity
Availability: Uptime, health checks

Dashboard Generator (`dashboard_generator.py`)

Creates comprehensive dashboard specifications with role-based optimization.

Features

Role-Based Layouts: Optimized for SRE, Developer, Executive, and Ops personas
Golden Signals Coverage: Automatic inclusion of key monitoring metrics
Service-Type Specific Panels: Tailored panels based on service characteristics
Interactive Elements: Template variables, drill-down paths, time range controls
Grafana Compatibility: Generates Grafana-compatible JSON

Usage Examples

# From service definition
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_web.json \
    --output dashboard.json

# With specific role optimization
python3 scripts/dashboard_generator.py \
    --service-type api \
    --name "Payment Service" \
    --role developer \
    --output payment_dev_dashboard.json

# Generate Grafana-compatible JSON
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_api.json \
    --output dashboard.json \
    --format grafana

# With documentation
python3 scripts/dashboard_generator.py \
    --service-type web \
    --name "Customer Portal" \
    --output portal_dashboard.json \
    --doc-output portal_docs.md

Target Roles

sre: Focus on availability, latency, errors, resource utilization
developer: Emphasize latency, errors, throughput, business metrics
executive: Highlight availability, business metrics, user experience
ops: Priority on resource utilization, capacity, alerts, deployments

Panel Types

Stat: Single value displays with thresholds
Gauge: Resource utilization and capacity metrics
Timeseries: Trend analysis and historical data
Table: Top N lists and detailed breakdowns
Heatmap: Distribution and correlation analysis

Sample Data

The assets/ directory contains sample configurations for testing:

sample_service_api.json: Critical API service definition
sample_service_web.json: High-priority web application definition
sample_alerts.json: Alert configuration with optimization opportunities

The expected_outputs/ directory shows example outputs from each script:

sample_slo_framework.json: Complete SLO framework for API service
optimized_alerts.json: Optimized alert configuration
sample_dashboard.json: SRE dashboard specification

Best Practices

SLO Design

Start with 1-2 SLOs per service and iterate
Choose SLIs that directly impact user experience
Set targets based on user needs, not technical capabilities
Use error budgets to balance reliability and velocity

Alert Optimization

Every alert must be actionable
Alert on symptoms, not causes
Use multi-window burn rate alerts for SLO protection
Implement proper escalation and routing policies

Dashboard Design

Follow the F-pattern for visual hierarchy
Use consistent color semantics across dashboards
Include drill-down paths for effective troubleshooting
Optimize for the target role's specific needs

Integration Patterns

CI/CD Integration

# Generate SLOs during service onboarding
python3 scripts/slo_designer.py --input service-config.json --output slos.json

# Validate alert configurations in pipeline
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html

# Auto-generate dashboards for new services
python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json

Monitoring Stack Integration

Prometheus: Generated alert rules and recording rules
Grafana: Dashboard JSON for direct import
Alertmanager: Routing and escalation policies
PagerDuty: Escalation configuration

GitOps Workflow

Store service definitions in version control
Generate observability configurations in CI/CD
Deploy configurations via GitOps
Monitor effectiveness and iterate

Advanced Usage

Custom SLO Targets

Override default targets by including them in service definitions:

{
  "name": "special-service",
  "type": "api",
  "criticality": "high",
  "custom_slos": {
    "availability_target": 0.9995,
    "latency_p95_target_ms": 150,
    "error_rate_target": 0.002
  }
}

Alert Rule Templates

Use template variables for reusable alert rules:

# Generated Prometheus alert rule
- alert: {{ service_name }}_HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
  for: 5m
  labels:
    severity: warning
    service: "{{ service_name }}"

Dashboard Variants

Generate multiple dashboard variants for different use cases:

# SRE operational dashboard
python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json

# Developer debugging dashboard  
python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json

# Executive business dashboard
python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json

Troubleshooting

Common Issues

Script Execution Errors

Ensure Python 3.7+ is installed
Check file paths and permissions
Validate JSON syntax in input files

Invalid Service Definitions

Required fields: name, type, criticality
Valid service types: api, web, database, queue, batch, ml
Valid criticality levels: critical, high, medium, low

Missing Historical Data

Alert historical data is optional but improves analysis
Include fires_per_day and false_positive_rate when available
Use monitoring system APIs to populate historical metrics

Debug Mode

Enable verbose logging by setting environment variable:

export DEBUG=1
python3 scripts/slo_designer.py --input service.json

Contributing

Development Setup

# Clone the repository
git clone <repository-url>
cd engineering/observability-designer

# Run tests
python3 -m pytest tests/

# Lint code
python3 -m flake8 scripts/

Adding New Features

Follow existing code patterns and error handling
Include comprehensive docstrings and type hints
Add test cases for new functionality
Update documentation and examples

Support

For questions, issues, or feature requests:

Check existing documentation and examples
Review the reference materials in references/
Open an issue with detailed reproduction steps
Include sample configurations when reporting bugs

This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the references/ directory.

12 KiB Raw Blame History

Observability Designer

Overview

Quick Start

Prerequisites

Basic Usage

Scripts Documentation

SLO Designer (slo_designer.py)

Features

Usage Examples

Service Definition Format

Supported Service Types

Criticality Levels

Alert Optimizer (alert_optimizer.py)

Features

Usage Examples

Alert Configuration Format

Analysis Categories

Dashboard Generator (dashboard_generator.py)

Features

Usage Examples

Target Roles

Panel Types

Sample Data

Best Practices

SLO Design

Alert Optimization

Dashboard Design

Integration Patterns

CI/CD Integration

Monitoring Stack Integration

GitOps Workflow

Advanced Usage

Custom SLO Targets

Alert Rule Templates

Dashboard Variants

Troubleshooting

Common Issues

Script Execution Errors

Invalid Service Definitions

Missing Historical Data

Debug Mode

Contributing

Development Setup

Adding New Features

Support

12 KiB

Raw Blame History

SLO Designer (`slo_designer.py`)

Alert Optimizer (`alert_optimizer.py`)

Dashboard Generator (`dashboard_generator.py`)