12 KiB
Observability Designer
A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.
Overview
The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:
- SLO Designer: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
- Alert Optimizer: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
- Dashboard Generator: Create comprehensive dashboard specifications with role-based layouts and drill-down paths
Quick Start
Prerequisites
- Python 3.7+
- No external dependencies required (uses Python standard library only)
Basic Usage
# Generate SLO framework for a service
python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service
# Optimize existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
# Generate a dashboard specification
python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre
Scripts Documentation
SLO Designer (slo_designer.py)
Generates comprehensive SLO frameworks based on service characteristics.
Features
- Automatic SLI Selection: Recommends appropriate SLIs based on service type
- Target Setting: Suggests SLO targets based on service criticality
- Error Budget Calculation: Computes error budgets and burn rate thresholds
- Multi-Window Burn Rate Alerts: Generates 4-window burn rate alerting rules
- SLA Recommendations: Provides customer-facing SLA guidance
Usage Examples
# From service definition file
python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json
# From command line parameters
python3 scripts/slo_designer.py \
--service-type api \
--criticality critical \
--user-facing true \
--service-name payment-service \
--output payment_slos.json
# Generate and display summary only
python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only
Service Definition Format
{
"name": "payment-service",
"type": "api",
"criticality": "critical",
"user_facing": true,
"description": "Handles payment processing",
"team": "payments",
"environment": "production",
"dependencies": [
{
"name": "user-service",
"type": "api",
"criticality": "high"
}
]
}
Supported Service Types
- api: REST APIs, GraphQL services
- web: Web applications, SPAs
- database: Database services, data stores
- queue: Message queues, event streams
- batch: Batch processing jobs
- ml: Machine learning services
Criticality Levels
- critical: 99.99% availability, <100ms P95 latency, <0.1% error rate
- high: 99.9% availability, <200ms P95 latency, <0.5% error rate
- medium: 99.5% availability, <500ms P95 latency, <1% error rate
- low: 99% availability, <1s P95 latency, <2% error rate
Alert Optimizer (alert_optimizer.py)
Analyzes existing alert configurations and provides optimization recommendations.
Features
- Noise Detection: Identifies alerts with high false positive rates
- Coverage Analysis: Finds gaps in monitoring coverage
- Duplicate Detection: Locates redundant or overlapping alerts
- Threshold Analysis: Reviews alert thresholds for appropriateness
- Fatigue Assessment: Evaluates alert volume and routing
Usage Examples
# Analyze existing alerts
python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
# Generate optimized configuration
python3 scripts/alert_optimizer.py \
--input assets/sample_alerts.json \
--output optimized_alerts.json
# Generate HTML report
python3 scripts/alert_optimizer.py \
--input assets/sample_alerts.json \
--report alert_analysis.html \
--format html
Alert Configuration Format
{
"alerts": [
{
"alert": "HighLatency",
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
"for": "5m",
"labels": {
"severity": "warning",
"service": "payment-service"
},
"annotations": {
"summary": "High request latency detected",
"runbook_url": "https://runbooks.company.com/high-latency"
},
"historical_data": {
"fires_per_day": 2.5,
"false_positive_rate": 0.15
}
}
],
"services": [
{
"name": "payment-service",
"criticality": "critical"
}
]
}
Analysis Categories
- Golden Signals: Latency, traffic, errors, saturation
- Resource Utilization: CPU, memory, disk, network
- Business Metrics: Revenue, conversion, user engagement
- Security: Auth failures, suspicious activity
- Availability: Uptime, health checks
Dashboard Generator (dashboard_generator.py)
Creates comprehensive dashboard specifications with role-based optimization.
Features
- Role-Based Layouts: Optimized for SRE, Developer, Executive, and Ops personas
- Golden Signals Coverage: Automatic inclusion of key monitoring metrics
- Service-Type Specific Panels: Tailored panels based on service characteristics
- Interactive Elements: Template variables, drill-down paths, time range controls
- Grafana Compatibility: Generates Grafana-compatible JSON
Usage Examples
# From service definition
python3 scripts/dashboard_generator.py \
--input assets/sample_service_web.json \
--output dashboard.json
# With specific role optimization
python3 scripts/dashboard_generator.py \
--service-type api \
--name "Payment Service" \
--role developer \
--output payment_dev_dashboard.json
# Generate Grafana-compatible JSON
python3 scripts/dashboard_generator.py \
--input assets/sample_service_api.json \
--output dashboard.json \
--format grafana
# With documentation
python3 scripts/dashboard_generator.py \
--service-type web \
--name "Customer Portal" \
--output portal_dashboard.json \
--doc-output portal_docs.md
Target Roles
- sre: Focus on availability, latency, errors, resource utilization
- developer: Emphasize latency, errors, throughput, business metrics
- executive: Highlight availability, business metrics, user experience
- ops: Priority on resource utilization, capacity, alerts, deployments
Panel Types
- Stat: Single value displays with thresholds
- Gauge: Resource utilization and capacity metrics
- Timeseries: Trend analysis and historical data
- Table: Top N lists and detailed breakdowns
- Heatmap: Distribution and correlation analysis
Sample Data
The assets/ directory contains sample configurations for testing:
sample_service_api.json: Critical API service definitionsample_service_web.json: High-priority web application definitionsample_alerts.json: Alert configuration with optimization opportunities
The expected_outputs/ directory shows example outputs from each script:
sample_slo_framework.json: Complete SLO framework for API serviceoptimized_alerts.json: Optimized alert configurationsample_dashboard.json: SRE dashboard specification
Best Practices
SLO Design
- Start with 1-2 SLOs per service and iterate
- Choose SLIs that directly impact user experience
- Set targets based on user needs, not technical capabilities
- Use error budgets to balance reliability and velocity
Alert Optimization
- Every alert must be actionable
- Alert on symptoms, not causes
- Use multi-window burn rate alerts for SLO protection
- Implement proper escalation and routing policies
Dashboard Design
- Follow the F-pattern for visual hierarchy
- Use consistent color semantics across dashboards
- Include drill-down paths for effective troubleshooting
- Optimize for the target role's specific needs
Integration Patterns
CI/CD Integration
# Generate SLOs during service onboarding
python3 scripts/slo_designer.py --input service-config.json --output slos.json
# Validate alert configurations in pipeline
python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html
# Auto-generate dashboards for new services
python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json
Monitoring Stack Integration
- Prometheus: Generated alert rules and recording rules
- Grafana: Dashboard JSON for direct import
- Alertmanager: Routing and escalation policies
- PagerDuty: Escalation configuration
GitOps Workflow
- Store service definitions in version control
- Generate observability configurations in CI/CD
- Deploy configurations via GitOps
- Monitor effectiveness and iterate
Advanced Usage
Custom SLO Targets
Override default targets by including them in service definitions:
{
"name": "special-service",
"type": "api",
"criticality": "high",
"custom_slos": {
"availability_target": 0.9995,
"latency_p95_target_ms": 150,
"error_rate_target": 0.002
}
}
Alert Rule Templates
Use template variables for reusable alert rules:
# Generated Prometheus alert rule
- alert: {{ service_name }}_HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
for: 5m
labels:
severity: warning
service: "{{ service_name }}"
Dashboard Variants
Generate multiple dashboard variants for different use cases:
# SRE operational dashboard
python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json
# Developer debugging dashboard
python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json
# Executive business dashboard
python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json
Troubleshooting
Common Issues
Script Execution Errors
- Ensure Python 3.7+ is installed
- Check file paths and permissions
- Validate JSON syntax in input files
Invalid Service Definitions
- Required fields:
name,type,criticality - Valid service types:
api,web,database,queue,batch,ml - Valid criticality levels:
critical,high,medium,low
Missing Historical Data
- Alert historical data is optional but improves analysis
- Include
fires_per_dayandfalse_positive_ratewhen available - Use monitoring system APIs to populate historical metrics
Debug Mode
Enable verbose logging by setting environment variable:
export DEBUG=1
python3 scripts/slo_designer.py --input service.json
Contributing
Development Setup
# Clone the repository
git clone <repository-url>
cd engineering/observability-designer
# Run tests
python3 -m pytest tests/
# Lint code
python3 -m flake8 scripts/
Adding New Features
- Follow existing code patterns and error handling
- Include comprehensive docstrings and type hints
- Add test cases for new functionality
- Update documentation and examples
Support
For questions, issues, or feature requests:
- Check existing documentation and examples
- Review the reference materials in
references/ - Open an issue with detailed reproduction steps
- Include sample configurations when reporting bugs
This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the references/ directory.