add brain
This commit is contained in:
@@ -0,0 +1,384 @@
|
||||
# Observability Designer
|
||||
|
||||
A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.
|
||||
|
||||
## Overview
|
||||
|
||||
The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:
|
||||
|
||||
- **SLO Designer**: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
|
||||
- **Alert Optimizer**: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
|
||||
- **Dashboard Generator**: Create comprehensive dashboard specifications with role-based layouts and drill-down paths
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.7+
|
||||
- No external dependencies required (uses Python standard library only)
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Generate SLO framework for a service
|
||||
python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service
|
||||
|
||||
# Optimize existing alerts
|
||||
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
|
||||
|
||||
# Generate a dashboard specification
|
||||
python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre
|
||||
```
|
||||
|
||||
## Scripts Documentation
|
||||
|
||||
### SLO Designer (`slo_designer.py`)
|
||||
|
||||
Generates comprehensive SLO frameworks based on service characteristics.
|
||||
|
||||
#### Features
|
||||
- **Automatic SLI Selection**: Recommends appropriate SLIs based on service type
|
||||
- **Target Setting**: Suggests SLO targets based on service criticality
|
||||
- **Error Budget Calculation**: Computes error budgets and burn rate thresholds
|
||||
- **Multi-Window Burn Rate Alerts**: Generates 4-window burn rate alerting rules
|
||||
- **SLA Recommendations**: Provides customer-facing SLA guidance
|
||||
|
||||
#### Usage Examples
|
||||
|
||||
```bash
|
||||
# From service definition file
|
||||
python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json
|
||||
|
||||
# From command line parameters
|
||||
python3 scripts/slo_designer.py \
|
||||
--service-type api \
|
||||
--criticality critical \
|
||||
--user-facing true \
|
||||
--service-name payment-service \
|
||||
--output payment_slos.json
|
||||
|
||||
# Generate and display summary only
|
||||
python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only
|
||||
```
|
||||
|
||||
#### Service Definition Format
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "payment-service",
|
||||
"type": "api",
|
||||
"criticality": "critical",
|
||||
"user_facing": true,
|
||||
"description": "Handles payment processing",
|
||||
"team": "payments",
|
||||
"environment": "production",
|
||||
"dependencies": [
|
||||
{
|
||||
"name": "user-service",
|
||||
"type": "api",
|
||||
"criticality": "high"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Supported Service Types
|
||||
- **api**: REST APIs, GraphQL services
|
||||
- **web**: Web applications, SPAs
|
||||
- **database**: Database services, data stores
|
||||
- **queue**: Message queues, event streams
|
||||
- **batch**: Batch processing jobs
|
||||
- **ml**: Machine learning services
|
||||
|
||||
#### Criticality Levels
|
||||
- **critical**: 99.99% availability, <100ms P95 latency, <0.1% error rate
|
||||
- **high**: 99.9% availability, <200ms P95 latency, <0.5% error rate
|
||||
- **medium**: 99.5% availability, <500ms P95 latency, <1% error rate
|
||||
- **low**: 99% availability, <1s P95 latency, <2% error rate
|
||||
|
||||
### Alert Optimizer (`alert_optimizer.py`)
|
||||
|
||||
Analyzes existing alert configurations and provides optimization recommendations.
|
||||
|
||||
#### Features
|
||||
- **Noise Detection**: Identifies alerts with high false positive rates
|
||||
- **Coverage Analysis**: Finds gaps in monitoring coverage
|
||||
- **Duplicate Detection**: Locates redundant or overlapping alerts
|
||||
- **Threshold Analysis**: Reviews alert thresholds for appropriateness
|
||||
- **Fatigue Assessment**: Evaluates alert volume and routing
|
||||
|
||||
#### Usage Examples
|
||||
|
||||
```bash
|
||||
# Analyze existing alerts
|
||||
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
|
||||
|
||||
# Generate optimized configuration
|
||||
python3 scripts/alert_optimizer.py \
|
||||
--input assets/sample_alerts.json \
|
||||
--output optimized_alerts.json
|
||||
|
||||
# Generate HTML report
|
||||
python3 scripts/alert_optimizer.py \
|
||||
--input assets/sample_alerts.json \
|
||||
--report alert_analysis.html \
|
||||
--format html
|
||||
```
|
||||
|
||||
#### Alert Configuration Format
|
||||
|
||||
```json
|
||||
{
|
||||
"alerts": [
|
||||
{
|
||||
"alert": "HighLatency",
|
||||
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
|
||||
"for": "5m",
|
||||
"labels": {
|
||||
"severity": "warning",
|
||||
"service": "payment-service"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "High request latency detected",
|
||||
"runbook_url": "https://runbooks.company.com/high-latency"
|
||||
},
|
||||
"historical_data": {
|
||||
"fires_per_day": 2.5,
|
||||
"false_positive_rate": 0.15
|
||||
}
|
||||
}
|
||||
],
|
||||
"services": [
|
||||
{
|
||||
"name": "payment-service",
|
||||
"criticality": "critical"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Analysis Categories
|
||||
- **Golden Signals**: Latency, traffic, errors, saturation
|
||||
- **Resource Utilization**: CPU, memory, disk, network
|
||||
- **Business Metrics**: Revenue, conversion, user engagement
|
||||
- **Security**: Auth failures, suspicious activity
|
||||
- **Availability**: Uptime, health checks
|
||||
|
||||
### Dashboard Generator (`dashboard_generator.py`)
|
||||
|
||||
Creates comprehensive dashboard specifications with role-based optimization.
|
||||
|
||||
#### Features
|
||||
- **Role-Based Layouts**: Optimized for SRE, Developer, Executive, and Ops personas
|
||||
- **Golden Signals Coverage**: Automatic inclusion of key monitoring metrics
|
||||
- **Service-Type Specific Panels**: Tailored panels based on service characteristics
|
||||
- **Interactive Elements**: Template variables, drill-down paths, time range controls
|
||||
- **Grafana Compatibility**: Generates Grafana-compatible JSON
|
||||
|
||||
#### Usage Examples
|
||||
|
||||
```bash
|
||||
# From service definition
|
||||
python3 scripts/dashboard_generator.py \
|
||||
--input assets/sample_service_web.json \
|
||||
--output dashboard.json
|
||||
|
||||
# With specific role optimization
|
||||
python3 scripts/dashboard_generator.py \
|
||||
--service-type api \
|
||||
--name "Payment Service" \
|
||||
--role developer \
|
||||
--output payment_dev_dashboard.json
|
||||
|
||||
# Generate Grafana-compatible JSON
|
||||
python3 scripts/dashboard_generator.py \
|
||||
--input assets/sample_service_api.json \
|
||||
--output dashboard.json \
|
||||
--format grafana
|
||||
|
||||
# With documentation
|
||||
python3 scripts/dashboard_generator.py \
|
||||
--service-type web \
|
||||
--name "Customer Portal" \
|
||||
--output portal_dashboard.json \
|
||||
--doc-output portal_docs.md
|
||||
```
|
||||
|
||||
#### Target Roles
|
||||
|
||||
- **sre**: Focus on availability, latency, errors, resource utilization
|
||||
- **developer**: Emphasize latency, errors, throughput, business metrics
|
||||
- **executive**: Highlight availability, business metrics, user experience
|
||||
- **ops**: Priority on resource utilization, capacity, alerts, deployments
|
||||
|
||||
#### Panel Types
|
||||
- **Stat**: Single value displays with thresholds
|
||||
- **Gauge**: Resource utilization and capacity metrics
|
||||
- **Timeseries**: Trend analysis and historical data
|
||||
- **Table**: Top N lists and detailed breakdowns
|
||||
- **Heatmap**: Distribution and correlation analysis
|
||||
|
||||
## Sample Data
|
||||
|
||||
The `assets/` directory contains sample configurations for testing:
|
||||
|
||||
- `sample_service_api.json`: Critical API service definition
|
||||
- `sample_service_web.json`: High-priority web application definition
|
||||
- `sample_alerts.json`: Alert configuration with optimization opportunities
|
||||
|
||||
The `expected_outputs/` directory shows example outputs from each script:
|
||||
|
||||
- `sample_slo_framework.json`: Complete SLO framework for API service
|
||||
- `optimized_alerts.json`: Optimized alert configuration
|
||||
- `sample_dashboard.json`: SRE dashboard specification
|
||||
|
||||
## Best Practices
|
||||
|
||||
### SLO Design
|
||||
- Start with 1-2 SLOs per service and iterate
|
||||
- Choose SLIs that directly impact user experience
|
||||
- Set targets based on user needs, not technical capabilities
|
||||
- Use error budgets to balance reliability and velocity
|
||||
|
||||
### Alert Optimization
|
||||
- Every alert must be actionable
|
||||
- Alert on symptoms, not causes
|
||||
- Use multi-window burn rate alerts for SLO protection
|
||||
- Implement proper escalation and routing policies
|
||||
|
||||
### Dashboard Design
|
||||
- Follow the F-pattern for visual hierarchy
|
||||
- Use consistent color semantics across dashboards
|
||||
- Include drill-down paths for effective troubleshooting
|
||||
- Optimize for the target role's specific needs
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### CI/CD Integration
|
||||
```bash
|
||||
# Generate SLOs during service onboarding
|
||||
python3 scripts/slo_designer.py --input service-config.json --output slos.json
|
||||
|
||||
# Validate alert configurations in pipeline
|
||||
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html
|
||||
|
||||
# Auto-generate dashboards for new services
|
||||
python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json
|
||||
```
|
||||
|
||||
### Monitoring Stack Integration
|
||||
- **Prometheus**: Generated alert rules and recording rules
|
||||
- **Grafana**: Dashboard JSON for direct import
|
||||
- **Alertmanager**: Routing and escalation policies
|
||||
- **PagerDuty**: Escalation configuration
|
||||
|
||||
### GitOps Workflow
|
||||
1. Store service definitions in version control
|
||||
2. Generate observability configurations in CI/CD
|
||||
3. Deploy configurations via GitOps
|
||||
4. Monitor effectiveness and iterate
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom SLO Targets
|
||||
Override default targets by including them in service definitions:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "special-service",
|
||||
"type": "api",
|
||||
"criticality": "high",
|
||||
"custom_slos": {
|
||||
"availability_target": 0.9995,
|
||||
"latency_p95_target_ms": 150,
|
||||
"error_rate_target": 0.002
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Alert Rule Templates
|
||||
Use template variables for reusable alert rules:
|
||||
|
||||
```yaml
|
||||
# Generated Prometheus alert rule
|
||||
- alert: {{ service_name }}_HighLatency
|
||||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: "{{ service_name }}"
|
||||
```
|
||||
|
||||
### Dashboard Variants
|
||||
Generate multiple dashboard variants for different use cases:
|
||||
|
||||
```bash
|
||||
# SRE operational dashboard
|
||||
python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json
|
||||
|
||||
# Developer debugging dashboard
|
||||
python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json
|
||||
|
||||
# Executive business dashboard
|
||||
python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### Script Execution Errors
|
||||
- Ensure Python 3.7+ is installed
|
||||
- Check file paths and permissions
|
||||
- Validate JSON syntax in input files
|
||||
|
||||
#### Invalid Service Definitions
|
||||
- Required fields: `name`, `type`, `criticality`
|
||||
- Valid service types: `api`, `web`, `database`, `queue`, `batch`, `ml`
|
||||
- Valid criticality levels: `critical`, `high`, `medium`, `low`
|
||||
|
||||
#### Missing Historical Data
|
||||
- Alert historical data is optional but improves analysis
|
||||
- Include `fires_per_day` and `false_positive_rate` when available
|
||||
- Use monitoring system APIs to populate historical metrics
|
||||
|
||||
### Debug Mode
|
||||
Enable verbose logging by setting environment variable:
|
||||
|
||||
```bash
|
||||
export DEBUG=1
|
||||
python3 scripts/slo_designer.py --input service.json
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
### Development Setup
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone <repository-url>
|
||||
cd engineering/observability-designer
|
||||
|
||||
# Run tests
|
||||
python3 -m pytest tests/
|
||||
|
||||
# Lint code
|
||||
python3 -m flake8 scripts/
|
||||
```
|
||||
|
||||
### Adding New Features
|
||||
1. Follow existing code patterns and error handling
|
||||
2. Include comprehensive docstrings and type hints
|
||||
3. Add test cases for new functionality
|
||||
4. Update documentation and examples
|
||||
|
||||
## Support
|
||||
|
||||
For questions, issues, or feature requests:
|
||||
- Check existing documentation and examples
|
||||
- Review the reference materials in `references/`
|
||||
- Open an issue with detailed reproduction steps
|
||||
- Include sample configurations when reporting bugs
|
||||
|
||||
---
|
||||
|
||||
*This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the `references/` directory.*
|
||||
Reference in New Issue
Block a user