CleanArchitecture-template/.brain/.agent/skills/engineering-advanced-skills/observability-designer/README.md

# Observability Designer

A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.

## Overview

The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:

- **SLO Designer**: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
- **Alert Optimizer**: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
- **Dashboard Generator**: Create comprehensive dashboard specifications with role-based layouts and drill-down paths

## Quick Start

### Prerequisites

- Python 3.7+
- No external dependencies required (uses Python standard library only)

### Basic Usage

```bash
# Generate SLO framework for a service
python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service

# Optimize existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate a dashboard specification
python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre
```

## Scripts Documentation

### SLO Designer (`slo_designer.py`)

Generates comprehensive SLO frameworks based on service characteristics.

#### Features
- **Automatic SLI Selection**: Recommends appropriate SLIs based on service type
- **Target Setting**: Suggests SLO targets based on service criticality
- **Error Budget Calculation**: Computes error budgets and burn rate thresholds
- **Multi-Window Burn Rate Alerts**: Generates 4-window burn rate alerting rules
- **SLA Recommendations**: Provides customer-facing SLA guidance

#### Usage Examples

```bash
# From service definition file
python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json

# From command line parameters
python3 scripts/slo_designer.py \
    --service-type api \
    --criticality critical \
    --user-facing true \
    --service-name payment-service \
    --output payment_slos.json

# Generate and display summary only
python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only
```

#### Service Definition Format

```json
{
  "name": "payment-service",
  "type": "api",
  "criticality": "critical",
  "user_facing": true,
  "description": "Handles payment processing",
  "team": "payments",
  "environment": "production",
  "dependencies": [
    {
      "name": "user-service",
      "type": "api",
      "criticality": "high"
    }
  ]
}
```

#### Supported Service Types
- **api**: REST APIs, GraphQL services
- **web**: Web applications, SPAs
- **database**: Database services, data stores
- **queue**: Message queues, event streams
- **batch**: Batch processing jobs
- **ml**: Machine learning services

#### Criticality Levels
- **critical**: 99.99% availability, <100ms P95 latency, <0.1% error rate
- **high**: 99.9% availability, <200ms P95 latency, <0.5% error rate
- **medium**: 99.5% availability, <500ms P95 latency, <1% error rate
- **low**: 99% availability, <1s P95 latency, <2% error rate

### Alert Optimizer (`alert_optimizer.py`)

Analyzes existing alert configurations and provides optimization recommendations.

#### Features
- **Noise Detection**: Identifies alerts with high false positive rates
- **Coverage Analysis**: Finds gaps in monitoring coverage
- **Duplicate Detection**: Locates redundant or overlapping alerts
- **Threshold Analysis**: Reviews alert thresholds for appropriateness
- **Fatigue Assessment**: Evaluates alert volume and routing

#### Usage Examples

```bash
# Analyze existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only

# Generate optimized configuration
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --output optimized_alerts.json

# Generate HTML report
python3 scripts/alert_optimizer.py \
    --input assets/sample_alerts.json \
    --report alert_analysis.html \
    --format html
```

#### Alert Configuration Format

```json
{
  "alerts": [
    {
      "alert": "HighLatency",
      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
      "for": "5m",
      "labels": {
        "severity": "warning",
        "service": "payment-service"
      },
      "annotations": {
        "summary": "High request latency detected",
        "runbook_url": "https://runbooks.company.com/high-latency"
      },
      "historical_data": {
        "fires_per_day": 2.5,
        "false_positive_rate": 0.15
      }
    }
  ],
  "services": [
    {
      "name": "payment-service",
      "criticality": "critical"
    }
  ]
}
```

#### Analysis Categories
- **Golden Signals**: Latency, traffic, errors, saturation
- **Resource Utilization**: CPU, memory, disk, network
- **Business Metrics**: Revenue, conversion, user engagement
- **Security**: Auth failures, suspicious activity
- **Availability**: Uptime, health checks

### Dashboard Generator (`dashboard_generator.py`)

Creates comprehensive dashboard specifications with role-based optimization.

#### Features
- **Role-Based Layouts**: Optimized for SRE, Developer, Executive, and Ops personas
- **Golden Signals Coverage**: Automatic inclusion of key monitoring metrics
- **Service-Type Specific Panels**: Tailored panels based on service characteristics
- **Interactive Elements**: Template variables, drill-down paths, time range controls
- **Grafana Compatibility**: Generates Grafana-compatible JSON

#### Usage Examples

```bash
# From service definition
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_web.json \
    --output dashboard.json

# With specific role optimization
python3 scripts/dashboard_generator.py \
    --service-type api \
    --name "Payment Service" \
    --role developer \
    --output payment_dev_dashboard.json

# Generate Grafana-compatible JSON
python3 scripts/dashboard_generator.py \
    --input assets/sample_service_api.json \
    --output dashboard.json \
    --format grafana

# With documentation
python3 scripts/dashboard_generator.py \
    --service-type web \
    --name "Customer Portal" \
    --output portal_dashboard.json \
    --doc-output portal_docs.md
```

#### Target Roles

- **sre**: Focus on availability, latency, errors, resource utilization
- **developer**: Emphasize latency, errors, throughput, business metrics
- **executive**: Highlight availability, business metrics, user experience
- **ops**: Priority on resource utilization, capacity, alerts, deployments

#### Panel Types
- **Stat**: Single value displays with thresholds
- **Gauge**: Resource utilization and capacity metrics
- **Timeseries**: Trend analysis and historical data
- **Table**: Top N lists and detailed breakdowns
- **Heatmap**: Distribution and correlation analysis

## Sample Data

The `assets/` directory contains sample configurations for testing:

- `sample_service_api.json`: Critical API service definition
- `sample_service_web.json`: High-priority web application definition
- `sample_alerts.json`: Alert configuration with optimization opportunities

The `expected_outputs/` directory shows example outputs from each script:

- `sample_slo_framework.json`: Complete SLO framework for API service
- `optimized_alerts.json`: Optimized alert configuration
- `sample_dashboard.json`: SRE dashboard specification

## Best Practices

### SLO Design
- Start with 1-2 SLOs per service and iterate
- Choose SLIs that directly impact user experience
- Set targets based on user needs, not technical capabilities
- Use error budgets to balance reliability and velocity

### Alert Optimization
- Every alert must be actionable
- Alert on symptoms, not causes
- Use multi-window burn rate alerts for SLO protection
- Implement proper escalation and routing policies

### Dashboard Design
- Follow the F-pattern for visual hierarchy
- Use consistent color semantics across dashboards
- Include drill-down paths for effective troubleshooting
- Optimize for the target role's specific needs

## Integration Patterns

### CI/CD Integration
```bash
# Generate SLOs during service onboarding
python3 scripts/slo_designer.py --input service-config.json --output slos.json

# Validate alert configurations in pipeline
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html

# Auto-generate dashboards for new services
python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json
```

### Monitoring Stack Integration
- **Prometheus**: Generated alert rules and recording rules
- **Grafana**: Dashboard JSON for direct import
- **Alertmanager**: Routing and escalation policies
- **PagerDuty**: Escalation configuration

### GitOps Workflow
1. Store service definitions in version control
2. Generate observability configurations in CI/CD
3. Deploy configurations via GitOps
4. Monitor effectiveness and iterate

## Advanced Usage

### Custom SLO Targets
Override default targets by including them in service definitions:

```json
{
  "name": "special-service",
  "type": "api",
  "criticality": "high",
  "custom_slos": {
    "availability_target": 0.9995,
    "latency_p95_target_ms": 150,
    "error_rate_target": 0.002
  }
}
```

### Alert Rule Templates
Use template variables for reusable alert rules:

```yaml
# Generated Prometheus alert rule
- alert: {{ service_name }}_HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
  for: 5m
  labels:
    severity: warning
    service: "{{ service_name }}"
```

### Dashboard Variants
Generate multiple dashboard variants for different use cases:

```bash
# SRE operational dashboard
python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json

# Developer debugging dashboard
python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json

# Executive business dashboard
python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json
```

## Troubleshooting

### Common Issues

#### Script Execution Errors
- Ensure Python 3.7+ is installed
- Check file paths and permissions
- Validate JSON syntax in input files

#### Invalid Service Definitions
- Required fields: `name`, `type`, `criticality`
- Valid service types: `api`, `web`, `database`, `queue`, `batch`, `ml`
- Valid criticality levels: `critical`, `high`, `medium`, `low`

#### Missing Historical Data
- Alert historical data is optional but improves analysis
- Include `fires_per_day` and `false_positive_rate` when available
- Use monitoring system APIs to populate historical metrics

### Debug Mode
Enable verbose logging by setting environment variable:

```bash
export DEBUG=1
python3 scripts/slo_designer.py --input service.json
```

## Contributing

### Development Setup
```bash
# Clone the repository
git clone <repository-url>
cd engineering/observability-designer

# Run tests
python3 -m pytest tests/

# Lint code
python3 -m flake8 scripts/
```

### Adding New Features
1. Follow existing code patterns and error handling
2. Include comprehensive docstrings and type hints
3. Add test cases for new functionality
4. Update documentation and examples

## Support

For questions, issues, or feature requests:
- Check existing documentation and examples
- Review the reference materials in `references/`
- Open an issue with detailed reproduction steps
- Include sample configurations when reporting bugs

---

*This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the `references/` directory.*