add brain

2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions
--- a/.brain/.agent/skills/engineering-advanced-skills/observability-designer/README.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/observability-designer/README.md
@@ -0,0 +1,384 @@
+# Observability Designer
+
+A comprehensive toolkit for designing production-ready observability strategies including SLI/SLO frameworks, alert optimization, and dashboard generation.
+
+## Overview
+
+The Observability Designer skill provides three powerful Python scripts that help you create, optimize, and maintain observability systems:
+
+- **SLO Designer**: Generate complete SLI/SLO frameworks with error budgets and burn rate alerts
+- **Alert Optimizer**: Analyze and optimize existing alert configurations to reduce noise and improve effectiveness
+- **Dashboard Generator**: Create comprehensive dashboard specifications with role-based layouts and drill-down paths
+
+## Quick Start
+
+### Prerequisites
+
+- Python 3.7+
+- No external dependencies required (uses Python standard library only)
+
+### Basic Usage
+
+```bash
+# Generate SLO framework for a service
+python3 scripts/slo_designer.py --service-type api --criticality critical --user-facing true --service-name payment-service
+
+# Optimize existing alerts
+python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
+
+# Generate a dashboard specification
+python3 scripts/dashboard_generator.py --service-type web --name "Customer Portal" --role sre
+```
+
+## Scripts Documentation
+
+### SLO Designer (`slo_designer.py`)
+
+Generates comprehensive SLO frameworks based on service characteristics.
+
+#### Features
+- **Automatic SLI Selection**: Recommends appropriate SLIs based on service type
+- **Target Setting**: Suggests SLO targets based on service criticality
+- **Error Budget Calculation**: Computes error budgets and burn rate thresholds
+- **Multi-Window Burn Rate Alerts**: Generates 4-window burn rate alerting rules
+- **SLA Recommendations**: Provides customer-facing SLA guidance
+
+#### Usage Examples
+
+```bash
+# From service definition file
+python3 scripts/slo_designer.py --input assets/sample_service_api.json --output slo_framework.json
+
+# From command line parameters
+python3 scripts/slo_designer.py \
+    --service-type api \
+    --criticality critical \
+    --user-facing true \
+    --service-name payment-service \
+    --output payment_slos.json
+
+# Generate and display summary only
+python3 scripts/slo_designer.py --input assets/sample_service_web.json --summary-only
+```
+
+#### Service Definition Format
+
+```json
+{
+  "name": "payment-service",
+  "type": "api",
+  "criticality": "critical",
+  "user_facing": true,
+  "description": "Handles payment processing",
+  "team": "payments",
+  "environment": "production",
+  "dependencies": [
+    {
+      "name": "user-service",
+      "type": "api",
+      "criticality": "high"
+    }
+  ]
+}
+```
+
+#### Supported Service Types
+- **api**: REST APIs, GraphQL services
+- **web**: Web applications, SPAs
+- **database**: Database services, data stores
+- **queue**: Message queues, event streams
+- **batch**: Batch processing jobs
+- **ml**: Machine learning services
+
+#### Criticality Levels
+- **critical**: 99.99% availability, <100ms P95 latency, <0.1% error rate
+- **high**: 99.9% availability, <200ms P95 latency, <0.5% error rate
+- **medium**: 99.5% availability, <500ms P95 latency, <1% error rate
+- **low**: 99% availability, <1s P95 latency, <2% error rate
+
+### Alert Optimizer (`alert_optimizer.py`)
+
+Analyzes existing alert configurations and provides optimization recommendations.
+
+#### Features
+- **Noise Detection**: Identifies alerts with high false positive rates
+- **Coverage Analysis**: Finds gaps in monitoring coverage
+- **Duplicate Detection**: Locates redundant or overlapping alerts  
+- **Threshold Analysis**: Reviews alert thresholds for appropriateness
+- **Fatigue Assessment**: Evaluates alert volume and routing
+
+#### Usage Examples
+
+```bash
+# Analyze existing alerts
+python3 scripts/alert_optimizer.py --input assets/sample_alerts.json --analyze-only
+
+# Generate optimized configuration
+python3 scripts/alert_optimizer.py \
+    --input assets/sample_alerts.json \
+    --output optimized_alerts.json
+
+# Generate HTML report
+python3 scripts/alert_optimizer.py \
+    --input assets/sample_alerts.json \
+    --report alert_analysis.html \
+    --format html
+```
+
+#### Alert Configuration Format
+
+```json
+{
+  "alerts": [
+    {
+      "alert": "HighLatency",
+      "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5",
+      "for": "5m",
+      "labels": {
+        "severity": "warning",
+        "service": "payment-service"
+      },
+      "annotations": {
+        "summary": "High request latency detected",
+        "runbook_url": "https://runbooks.company.com/high-latency"
+      },
+      "historical_data": {
+        "fires_per_day": 2.5,
+        "false_positive_rate": 0.15
+      }
+    }
+  ],
+  "services": [
+    {
+      "name": "payment-service",
+      "criticality": "critical"
+    }
+  ]
+}
+```
+
+#### Analysis Categories
+- **Golden Signals**: Latency, traffic, errors, saturation
+- **Resource Utilization**: CPU, memory, disk, network
+- **Business Metrics**: Revenue, conversion, user engagement
+- **Security**: Auth failures, suspicious activity
+- **Availability**: Uptime, health checks
+
+### Dashboard Generator (`dashboard_generator.py`)
+
+Creates comprehensive dashboard specifications with role-based optimization.
+
+#### Features
+- **Role-Based Layouts**: Optimized for SRE, Developer, Executive, and Ops personas
+- **Golden Signals Coverage**: Automatic inclusion of key monitoring metrics
+- **Service-Type Specific Panels**: Tailored panels based on service characteristics
+- **Interactive Elements**: Template variables, drill-down paths, time range controls
+- **Grafana Compatibility**: Generates Grafana-compatible JSON
+
+#### Usage Examples
+
+```bash
+# From service definition
+python3 scripts/dashboard_generator.py \
+    --input assets/sample_service_web.json \
+    --output dashboard.json
+
+# With specific role optimization
+python3 scripts/dashboard_generator.py \
+    --service-type api \
+    --name "Payment Service" \
+    --role developer \
+    --output payment_dev_dashboard.json
+
+# Generate Grafana-compatible JSON
+python3 scripts/dashboard_generator.py \
+    --input assets/sample_service_api.json \
+    --output dashboard.json \
+    --format grafana
+
+# With documentation
+python3 scripts/dashboard_generator.py \
+    --service-type web \
+    --name "Customer Portal" \
+    --output portal_dashboard.json \
+    --doc-output portal_docs.md
+```
+
+#### Target Roles
+
+- **sre**: Focus on availability, latency, errors, resource utilization
+- **developer**: Emphasize latency, errors, throughput, business metrics  
+- **executive**: Highlight availability, business metrics, user experience
+- **ops**: Priority on resource utilization, capacity, alerts, deployments
+
+#### Panel Types
+- **Stat**: Single value displays with thresholds
+- **Gauge**: Resource utilization and capacity metrics
+- **Timeseries**: Trend analysis and historical data
+- **Table**: Top N lists and detailed breakdowns
+- **Heatmap**: Distribution and correlation analysis
+
+## Sample Data
+
+The `assets/` directory contains sample configurations for testing:
+
+- `sample_service_api.json`: Critical API service definition
+- `sample_service_web.json`: High-priority web application definition  
+- `sample_alerts.json`: Alert configuration with optimization opportunities
+
+The `expected_outputs/` directory shows example outputs from each script:
+
+- `sample_slo_framework.json`: Complete SLO framework for API service
+- `optimized_alerts.json`: Optimized alert configuration
+- `sample_dashboard.json`: SRE dashboard specification
+
+## Best Practices
+
+### SLO Design
+- Start with 1-2 SLOs per service and iterate
+- Choose SLIs that directly impact user experience
+- Set targets based on user needs, not technical capabilities
+- Use error budgets to balance reliability and velocity
+
+### Alert Optimization
+- Every alert must be actionable
+- Alert on symptoms, not causes
+- Use multi-window burn rate alerts for SLO protection
+- Implement proper escalation and routing policies
+
+### Dashboard Design  
+- Follow the F-pattern for visual hierarchy
+- Use consistent color semantics across dashboards
+- Include drill-down paths for effective troubleshooting
+- Optimize for the target role's specific needs
+
+## Integration Patterns
+
+### CI/CD Integration
+```bash
+# Generate SLOs during service onboarding
+python3 scripts/slo_designer.py --input service-config.json --output slos.json
+
+# Validate alert configurations in pipeline
+python3 scripts/alert_optimizer.py --input alerts.json --analyze-only --report validation.html
+
+# Auto-generate dashboards for new services
+python3 scripts/dashboard_generator.py --input service-config.json --format grafana --output dashboard.json
+```
+
+### Monitoring Stack Integration
+- **Prometheus**: Generated alert rules and recording rules
+- **Grafana**: Dashboard JSON for direct import
+- **Alertmanager**: Routing and escalation policies
+- **PagerDuty**: Escalation configuration
+
+### GitOps Workflow
+1. Store service definitions in version control
+2. Generate observability configurations in CI/CD
+3. Deploy configurations via GitOps
+4. Monitor effectiveness and iterate
+
+## Advanced Usage
+
+### Custom SLO Targets
+Override default targets by including them in service definitions:
+
+```json
+{
+  "name": "special-service",
+  "type": "api",
+  "criticality": "high",
+  "custom_slos": {
+    "availability_target": 0.9995,
+    "latency_p95_target_ms": 150,
+    "error_rate_target": 0.002
+  }
+}
+```
+
+### Alert Rule Templates
+Use template variables for reusable alert rules:
+
+```yaml
+# Generated Prometheus alert rule
+- alert: {{ service_name }}_HighLatency
+  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="{{ service_name }}"}[5m])) > {{ latency_threshold }}
+  for: 5m
+  labels:
+    severity: warning
+    service: "{{ service_name }}"
+```
+
+### Dashboard Variants
+Generate multiple dashboard variants for different use cases:
+
+```bash
+# SRE operational dashboard
+python3 scripts/dashboard_generator.py --input service.json --role sre --output sre-dashboard.json
+
+# Developer debugging dashboard  
+python3 scripts/dashboard_generator.py --input service.json --role developer --output dev-dashboard.json
+
+# Executive business dashboard
+python3 scripts/dashboard_generator.py --input service.json --role executive --output exec-dashboard.json
+```
+
+## Troubleshooting
+
+### Common Issues
+
+#### Script Execution Errors
+- Ensure Python 3.7+ is installed
+- Check file paths and permissions
+- Validate JSON syntax in input files
+
+#### Invalid Service Definitions
+- Required fields: `name`, `type`, `criticality`
+- Valid service types: `api`, `web`, `database`, `queue`, `batch`, `ml`
+- Valid criticality levels: `critical`, `high`, `medium`, `low`
+
+#### Missing Historical Data
+- Alert historical data is optional but improves analysis
+- Include `fires_per_day` and `false_positive_rate` when available
+- Use monitoring system APIs to populate historical metrics
+
+### Debug Mode
+Enable verbose logging by setting environment variable:
+
+```bash
+export DEBUG=1
+python3 scripts/slo_designer.py --input service.json
+```
+
+## Contributing
+
+### Development Setup
+```bash
+# Clone the repository
+git clone <repository-url>
+cd engineering/observability-designer
+
+# Run tests
+python3 -m pytest tests/
+
+# Lint code
+python3 -m flake8 scripts/
+```
+
+### Adding New Features
+1. Follow existing code patterns and error handling
+2. Include comprehensive docstrings and type hints  
+3. Add test cases for new functionality
+4. Update documentation and examples
+
+## Support
+
+For questions, issues, or feature requests:
+- Check existing documentation and examples
+- Review the reference materials in `references/`
+- Open an issue with detailed reproduction steps
+- Include sample configurations when reporting bugs
+
+---
+
+*This skill is part of the Claude Skills marketplace. For more information about observability best practices, see the reference documentation in the `references/` directory.*