Files
2026-03-12 15:17:52 +07:00

384 lines
12 KiB
Markdown

# Observability Designer
A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.
## Overview
The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:
- **SLO Designer**: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
- **Alert Optimizer**: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
- **Dashboard Generator**: Create comprehensive dashboard specifications with role-based layouts and drill-down paths
## Quick Start
### Prerequisites
- Python 3.7+
- No external dependencies required (uses Python standard library only)
### Basic Usage
```bash
# Generate SLO framework for a service
python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service
# Optimize existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
# Generate a dashboard specification
python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre
```
## Scripts Documentation
### SLO Designer (`slo_designer.py`)
Generates comprehensive SLO frameworks based on service characteristics.
#### Features
- **Automatic SLI Selection**: Recommends appropriate SLIs based on service type
- **Target Setting**: Suggests SLO targets based on service criticality
- **Error Budget Calculation**: Computes error budgets and burn rate thresholds
- **Multi-Window Burn Rate Alerts**: Generates 4-window burn rate alerting rules
- **SLA Recommendations**: Provides customer-facing SLA guidance
#### Usage Examples
```bash
# From service definition file
python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json
# From command line parameters
python3 scripts/slo_designer.py \
--service-type api \
--criticality critical \
--user-facing true \
--service-name payment-service \
--output payment_slos.json
# Generate and display summary only
python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only
```
#### Service Definition Format
```json
{
"name": "payment-service",
"type": "api",
"criticality": "critical",
"user_facing": true,
"description": "Handles payment processing",
"team": "payments",
"environment": "production",
"dependencies": [
{
"name": "user-service",
"type": "api",
"criticality": "high"
}
]
}
```
#### Supported Service Types
- **api**: REST APIs, GraphQL services
- **web**: Web applications, SPAs
- **database**: Database services, data stores
- **queue**: Message queues, event streams
- **batch**: Batch processing jobs
- **ml**: Machine learning services
#### Criticality Levels
- **critical**: 99.99% availability, <100ms P95 latency, <0.1% error rate
- **high**: 99.9% availability, <200ms P95 latency, <0.5% error rate
- **medium**: 99.5% availability, <500ms P95 latency, <1% error rate
- **low**: 99% availability, <1s P95 latency, <2% error rate
### Alert Optimizer (`alert_optimizer.py`)
Analyzes existing alert configurations and provides optimization recommendations.
#### Features
- **Noise Detection**: Identifies alerts with high false positive rates
- **Coverage Analysis**: Finds gaps in monitoring coverage
- **Duplicate Detection**: Locates redundant or overlapping alerts
- **Threshold Analysis**: Reviews alert thresholds for appropriateness
- **Fatigue Assessment**: Evaluates alert volume and routing
#### Usage Examples
```bash
# Analyze existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
# Generate optimized configuration
python3 scripts/alert_optimizer.py \
--input assets/sample_alerts.json \
--output optimized_alerts.json
# Generate HTML report
python3 scripts/alert_optimizer.py \
--input assets/sample_alerts.json \
--report alert_analysis.html \
--format html
```
#### Alert Configuration Format
```json
{
"alerts": [
{
"alert": "HighLatency",
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
"for": "5m",
"labels": {
"severity": "warning",
"service": "payment-service"
},
"annotations": {
"summary": "High request latency detected",
"runbook_url": "https://runbooks.company.com/high-latency"
},
"historical_data": {
"fires_per_day": 2.5,
"false_positive_rate": 0.15
}
}
],
"services": [
{
"name": "payment-service",
"criticality": "critical"
}
]
}
```
#### Analysis Categories
- **Golden Signals**: Latency, traffic, errors, saturation
- **Resource Utilization**: CPU, memory, disk, network
- **Business Metrics**: Revenue, conversion, user engagement
- **Security**: Auth failures, suspicious activity
- **Availability**: Uptime, health checks
### Dashboard Generator (`dashboard_generator.py`)
Creates comprehensive dashboard specifications with role-based optimization.
#### Features
- **Role-Based Layouts**: Optimized for SRE, Developer, Executive, and Ops personas
- **Golden Signals Coverage**: Automatic inclusion of key monitoring metrics
- **Service-Type Specific Panels**: Tailored panels based on service characteristics
- **Interactive Elements**: Template variables, drill-down paths, time range controls
- **Grafana Compatibility**: Generates Grafana-compatible JSON
#### Usage Examples
```bash
# From service definition
python3 scripts/dashboard_generator.py \
--input assets/sample_service_web.json \
--output dashboard.json
# With specific role optimization
python3 scripts/dashboard_generator.py \
--service-type api \
--name "Payment Service" \
--role developer \
--output payment_dev_dashboard.json
# Generate Grafana-compatible JSON
python3 scripts/dashboard_generator.py \
--input assets/sample_service_api.json \
--output dashboard.json \
--format grafana
# With documentation
python3 scripts/dashboard_generator.py \
--service-type web \
--name "Customer Portal" \
--output portal_dashboard.json \
--doc-output portal_docs.md
```
#### Target Roles
- **sre**: Focus on availability, latency, errors, resource utilization
- **developer**: Emphasize latency, errors, throughput, business metrics
- **executive**: Highlight availability, business metrics, user experience
- **ops**: Priority on resource utilization, capacity, alerts, deployments
#### Panel Types
- **Stat**: Single value displays with thresholds
- **Gauge**: Resource utilization and capacity metrics
- **Timeseries**: Trend analysis and historical data
- **Table**: Top N lists and detailed breakdowns
- **Heatmap**: Distribution and correlation analysis
## Sample Data
The `assets/` directory contains sample configurations for testing:
- `sample_service_api.json`: Critical API service definition
- `sample_service_web.json`: High-priority web application definition
- `sample_alerts.json`: Alert configuration with optimization opportunities
The `expected_outputs/` directory shows example outputs from each script:
- `sample_slo_framework.json`: Complete SLO framework for API service
- `optimized_alerts.json`: Optimized alert configuration
- `sample_dashboard.json`: SRE dashboard specification
## Best Practices
### SLO Design
- Start with 1-2 SLOs per service and iterate
- Choose SLIs that directly impact user experience
- Set targets based on user needs, not technical capabilities
- Use error budgets to balance reliability and velocity
### Alert Optimization
- Every alert must be actionable
- Alert on symptoms, not causes
- Use multi-window burn rate alerts for SLO protection
- Implement proper escalation and routing policies
### Dashboard Design
- Follow the F-pattern for visual hierarchy
- Use consistent color semantics across dashboards
- Include drill-down paths for effective troubleshooting
- Optimize for the target role's specific needs
## Integration Patterns
### CI/CD Integration
```bash
# Generate SLOs during service onboarding
python3 scripts/slo_designer.py --input service-config.json --output slos.json
# Validate alert configurations in pipeline
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html
# Auto-generate dashboards for new services
python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json
```
### Monitoring Stack Integration
- **Prometheus**: Generated alert rules and recording rules
- **Grafana**: Dashboard JSON for direct import
- **Alertmanager**: Routing and escalation policies
- **PagerDuty**: Escalation configuration
### GitOps Workflow
1. Store service definitions in version control
2. Generate observability configurations in CI/CD
3. Deploy configurations via GitOps
4. Monitor effectiveness and iterate
## Advanced Usage
### Custom SLO Targets
Override default targets by including them in service definitions:
```json
{
"name": "special-service",
"type": "api",
"criticality": "high",
"custom_slos": {
"availability_target": 0.9995,
"latency_p95_target_ms": 150,
"error_rate_target": 0.002
}
}
```
### Alert Rule Templates
Use template variables for reusable alert rules:
```yaml
# Generated Prometheus alert rule
- alert: {{ service_name }}_HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
for: 5m
labels:
severity: warning
service: "{{ service_name }}"
```
### Dashboard Variants
Generate multiple dashboard variants for different use cases:
```bash
# SRE operational dashboard
python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json
# Developer debugging dashboard
python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json
# Executive business dashboard
python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json
```
## Troubleshooting
### Common Issues
#### Script Execution Errors
- Ensure Python 3.7+ is installed
- Check file paths and permissions
- Validate JSON syntax in input files
#### Invalid Service Definitions
- Required fields: `name`, `type`, `criticality`
- Valid service types: `api`, `web`, `database`, `queue`, `batch`, `ml`
- Valid criticality levels: `critical`, `high`, `medium`, `low`
#### Missing Historical Data
- Alert historical data is optional but improves analysis
- Include `fires_per_day` and `false_positive_rate` when available
- Use monitoring system APIs to populate historical metrics
### Debug Mode
Enable verbose logging by setting environment variable:
```bash
export DEBUG=1
python3 scripts/slo_designer.py --input service.json
```
## Contributing
### Development Setup
```bash
# Clone the repository
git clone <repository-url>
cd engineering/observability-designer
# Run tests
python3 -m pytest tests/
# Lint code
python3 -m flake8 scripts/
```
### Adding New Features
1. Follow existing code patterns and error handling
2. Include comprehensive docstrings and type hints
3. Add test cases for new functionality
4. Update documentation and examples
## Support
For questions, issues, or feature requests:
- Check existing documentation and examples
- Review the reference materials in `references/`
- Open an issue with detailed reproduction steps
- Include sample configurations when reporting bugs
---
*This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the `references/` directory.*