add brain

2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions
--- a/.brain/.agent/skills/engineering-advanced-skills/runbook-generator/SKILL.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/runbook-generator/SKILL.md
@@ -0,0 +1,415 @@
+---
+name: "runbook-generator"
+description: "Runbook Generator"
+---
+
+# Runbook Generator
+
+**Tier:** POWERFUL  
+**Category:** Engineering  
+**Domain:** DevOps / Site Reliability Engineering  
+
+---
+
+## Overview
+
+Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.
+
+---
+
+## Core Capabilities
+
+- **Stack detection** — auto-identify CI/CD, database, hosting, orchestration from repo files
+- **Runbook types** — deployment, incident response, database maintenance, scaling, monitoring setup
+- **Format discipline** — numbered steps, copy-paste commands, ✅ verification checks, time estimates
+- **Escalation paths** — L1 → L2 → L3 with contact info and decision criteria
+- **Rollback procedures** — every deployment step has a corresponding undo
+- **Staleness detection** — runbook sections reference config files; flag when source changes
+- **Testing methodology** — dry-run framework for staging validation, quarterly review cadence
+
+---
+
+## When to Use
+
+Use when:
+- A codebase has no runbooks and you need to bootstrap them fast
+- Existing runbooks are outdated or incomplete (point at the repo, regenerate)
+- Onboarding a new engineer who needs clear operational procedures
+- Preparing for an incident response drill or audit
+- Setting up monitoring and on-call rotation from scratch
+
+Skip when:
+- The system is too early-stage to have stable operational patterns
+- Runbooks already exist and only need minor updates (edit directly)
+
+---
+
+## Stack Detection
+
+When given a repo, scan for these signals before writing a single runbook line:
+
+```bash
+# CI/CD
+ls .github/workflows/     → GitHub Actions
+ls .gitlab-ci.yml         → GitLab CI
+ls Jenkinsfile            → Jenkins
+ls .circleci/             → CircleCI
+ls bitbucket-pipelines.yml → Bitbucket Pipelines
+
+# Database
+grep -r "postgresql\|postgres\|pg" package.json pyproject.toml → PostgreSQL
+grep -r "mysql\|mariadb"           package.json               → MySQL
+grep -r "mongodb\|mongoose"        package.json               → MongoDB
+grep -r "redis"                    package.json               → Redis
+ls prisma/schema.prisma            → Prisma ORM (check provider field)
+ls drizzle.config.*                → Drizzle ORM
+
+# Hosting
+ls vercel.json                     → Vercel
+ls railway.toml                    → Railway
+ls fly.toml                        → Fly.io
+ls .ebextensions/                  → AWS Elastic Beanstalk
+ls terraform/  ls *.tf             → Custom AWS/GCP/Azure (check provider)
+ls kubernetes/ ls k8s/             → Kubernetes
+ls docker-compose.yml              → Docker Compose
+
+# Framework
+ls next.config.*                   → Next.js
+ls nuxt.config.*                   → Nuxt
+ls svelte.config.*                 → SvelteKit
+cat package.json | jq '.scripts'   → Check build/start commands
+```
+
+Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:
+- Deployment runbook (Vercel + GitHub Actions)
+- Database runbook (PostgreSQL backup, migration, vacuum)
+- Incident response (with Vercel logs + pg query debugging)
+- Monitoring setup (Vercel Analytics, pg_stat, alerting)
+
+---
+
+## Runbook Types
+
+### 1. Deployment Runbook
+
+```markdown
+# Deployment Runbook — [App Name]
+**Stack:** Next.js 14 + PostgreSQL 15 + Vercel  
+**Last verified:** 2025-03-01  
+**Source configs:** vercel.json (modified: git log -1 --format=%ci -- vercel.json)  
+**Owner:** Platform Team  
+**Est. total time:** 15–25 min  
+
+---
+
+## Pre-deployment Checklist
+- [ ] All PRs merged to main
+- [ ] CI passing on main (GitHub Actions green)
+- [ ] Database migrations tested in staging
+- [ ] Rollback plan confirmed
+
+## Steps
+
+### Step 1 — Run CI checks locally (3 min)
+```bash
+pnpm test
+pnpm lint
+pnpm build
+```
+✅ Expected: All pass with 0 errors. Build output in `.next/`
+
+### Step 2 — Apply database migrations (5 min)
+```bash
+# Staging first
+DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
+```
+✅ Expected: `All migrations have been successfully applied.`
+
+```bash
+# Verify migration applied
+psql $STAGING_DATABASE_URL -c "\d" | grep -i migration
+```
+✅ Expected: Migration table shows new entry with today's date
+
+### Step 3 — Deploy to production (5 min)
+```bash
+git push origin main
+# OR trigger manually:
+vercel --prod
+```
+✅ Expected: Vercel dashboard shows deployment in progress. URL format:
+`https://app-name-<hash>-team.vercel.app`
+
+### Step 4 — Smoke test production (5 min)
+```bash
+# Health check
+curl -sf https://your-app.vercel.app/api/health | jq .
+
+# Critical path
+curl -sf https://your-app.vercel.app/api/users/me \
+  -H "Authorization: Bearer $TEST_TOKEN" | jq '.id'
+```
+✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID.
+
+### Step 5 — Monitor for 10 min
+- Check Vercel Functions log for errors: `vercel logs --since=10m`
+- Check error rate in Vercel Analytics: < 1% 5xx
+- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections)
+
+---
+
+## Rollback
+
+If smoke tests fail or error rate spikes:
+
+```bash
+# Instant rollback via Vercel (preferred — < 30 sec)
+vercel rollback [previous-deployment-url]
+
+# Database rollback (only if migration was applied)
+DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate reset --skip-seed
+# WARNING: This resets to previous migration. Confirm data impact first.
+```
+
+✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.
+
+---
+
+## Escalation
+- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback
+- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead
+- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents
+```
+
+---
+
+### 2. Incident Response Runbook
+
+```markdown
+# Incident Response Runbook
+**Severity levels:** P1 (down), P2 (degraded), P3 (minor)  
+**Est. total time:** P1: 30–60 min, P2: 1–4 hours  
+
+## Phase 1 — Triage (5 min)
+
+### Confirm the incident
+```bash
+# Is the app responding?
+curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null
+
+# Check Vercel function errors (last 15 min)
+vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"
+```
+✅ 200 = app up. 5xx or timeout = incident confirmed.
+
+Declare severity:
+- Site completely down → P1 — page L2/L3 immediately
+- Partial degradation / slow responses → P2 — notify team channel
+- Single feature broken → P3 — create ticket, fix in business hours
+
+---
+
+## Phase 2 — Diagnose (10–15 min)
+
+```bash
+# Recent deployments — did something just ship?
+vercel ls --limit=5
+
+# Database health
+psql $DATABASE_URL -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"
+
+# Long-running queries (> 30 sec)
+psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"
+
+# Connection pool saturation
+psql $DATABASE_URL -c "SELECT count(*), max_conn FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') t GROUP BY max_conn;"
+```
+
+Diagnostic decision tree:
+- Recent deploy + new errors → rollback (see Deployment Runbook)
+- DB query timeout / pool saturation → kill long queries, scale connections
+- External dependency failing → check status pages, add circuit breaker
+- Memory/CPU spike → check Vercel function logs for infinite loops
+
+---
+
+## Phase 3 — Mitigate (variable)
+
+```bash
+# Kill a runaway DB query
+psql $DATABASE_URL -c "SELECT pg_terminate_backend(<pid>);"
+
+# Scale DB connections (Supabase/Neon — adjust pool size)
+# Vercel → Settings → Environment Variables → update DATABASE_POOL_MAX
+
+# Enable maintenance mode (if you have a feature flag)
+vercel env add MAINTENANCE_MODE true production
+vercel --prod  # redeploy with flag
+```
+
+---
+
+## Phase 4 — Resolve & Postmortem
+
+After incident is resolved, within 24 hours:
+
+1. Write incident timeline (what happened, when, who noticed, what fixed it)
+2. Identify root cause (5-Whys)
+3. Define action items with owners and due dates
+4. Update this runbook if a step was missing or wrong
+5. Add monitoring/alert that would have caught this earlier
+
+**Postmortem template:** `docs/postmortems/YYYY-MM-DD-incident-title.md`
+
+---
+
+## Escalation Path
+
+| Level | Who | When | Contact |
+|-------|-----|------|---------|
+| L1 | On-call engineer | Always first | PagerDuty rotation |
+| L2 | Platform lead | DB issues, rollback needed | Slack @platform-lead |
+| L3 | CTO/VP Eng | P1 > 30 min, data loss | Phone + PagerDuty |
+```
+
+---
+
+### 3. Database Maintenance Runbook
+
+```markdown
+# Database Maintenance Runbook — PostgreSQL
+**Schedule:** Weekly vacuum (automated), monthly manual review  
+
+## Backup
+
+```bash
+# Full backup
+pg_dump $DATABASE_URL \
+  --format=custom \
+  --compress=9 \
+  --file="backup-$(date +%Y%m%d-%H%M%S).dump"
+```
+✅ Expected: File created, size > 0. `pg_restore --list backup.dump | head -20` shows tables.
+
+Verify backup is restorable (test monthly):
+```bash
+pg_restore --dbname=$STAGING_DATABASE_URL backup.dump
+psql $STAGING_DATABASE_URL -c "SELECT count(*) FROM users;"
+```
+✅ Expected: Row count matches production.
+
+## Migration
+
+```bash
+# Always test in staging first
+DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
+# Verify, then:
+DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy
+```
+✅ Expected: `All migrations have been successfully applied.`
+
+⚠️ For large table migrations (> 1M rows), use `pg_repack` or add column with DEFAULT separately to avoid table locks.
+
+## Vacuum & Reindex
+
+```bash
+# Check bloat before deciding
+psql $DATABASE_URL -c "
+SELECT schemaname, tablename, 
+       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
+       n_dead_tup, n_live_tup,
+       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_ratio
+FROM pg_stat_user_tables
+ORDER BY n_dead_tup DESC LIMIT 10;"
+
+# Vacuum high-bloat tables (non-blocking)
+psql $DATABASE_URL -c "VACUUM ANALYZE users;"
+psql $DATABASE_URL -c "VACUUM ANALYZE events;"
+
+# Reindex (use CONCURRENTLY to avoid locks)
+psql $DATABASE_URL -c "REINDEX INDEX CONCURRENTLY users_email_idx;"
+```
+✅ Expected: dead_ratio drops below 5% after vacuum.
+```
+
+---
+
+## Staleness Detection
+
+Add a staleness header to every runbook:
+
+```markdown
+## Staleness Check
+This runbook references the following config files. If they've changed since the
+"Last verified" date, review the affected steps.
+
+| Config File | Last Modified | Affects Steps |
+|-------------|--------------|---------------|
+| vercel.json | `git log -1 --format=%ci -- vercel.json` | Step 3, Rollback |
+| prisma/schema.prisma | `git log -1 --format=%ci -- prisma/schema.prisma` | Step 2, DB Maintenance |
+| .github/workflows/deploy.yml | `git log -1 --format=%ci -- .github/workflows/deploy.yml` | Step 1, Step 3 |
+| docker-compose.yml | `git log -1 --format=%ci -- docker-compose.yml` | All scaling steps |
+```
+
+**Automation:** Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date.
+
+---
+
+## Runbook Testing Methodology
+
+### Dry-Run in Staging
+
+Before trusting a runbook in production, validate every step in staging:
+
+```bash
+# 1. Create a staging environment mirror
+vercel env pull .env.staging
+source .env.staging
+
+# 2. Run each step with staging credentials
+# Replace all $DATABASE_URL with $STAGING_DATABASE_URL
+# Replace all production URLs with staging URLs
+
+# 3. Verify expected outputs match
+# Document any discrepancies and update the runbook
+
+# 4. Time each step — update estimates in the runbook
+time npx prisma migrate deploy
+```
+
+### Quarterly Review Cadence
+
+Schedule a 1-hour review every quarter:
+
+1. **Run each command** in staging — does it still work?
+2. **Check config drift** — compare "Last Modified" dates vs "Last verified"
+3. **Test rollback procedures** — actually roll back in staging
+4. **Update contact info** — L1/L2/L3 may have changed
+5. **Add new failure modes** discovered in the past quarter
+6. **Update "Last verified" date** at top of runbook
+
+---
+
+## Common Pitfalls
+
+| Pitfall | Fix |
+|---|---|
+| Commands that require manual copy of dynamic values | Use env vars — `$DATABASE_URL` not `postgres://user:pass@host/db` |
+| No expected output specified | Add ✅ with exact expected string after every verification step |
+| Rollback steps missing | Every destructive step needs a corresponding undo |
+| Runbooks that never get tested | Schedule quarterly staging dry-runs in team calendar |
+| L3 escalation contact is the former CTO | Review contact info every quarter |
+| Migration runbook doesn't mention table locks | Call out lock risk for large table operations explicitly |
+
+---
+
+## Best Practices
+
+1. **Every command must be copy-pasteable** — no placeholder text, use env vars
+2. **✅ after every step** — explicit expected output, not "it should work"
+3. **Time estimates are mandatory** — engineers need to know if they have time to fix before SLA breach
+4. **Rollback before you deploy** — plan the undo before executing
+5. **Runbooks live in the repo** — `docs/runbooks/`, versioned with the code they describe
+6. **Postmortem → runbook update** — every incident should improve a runbook
+7. **Link, don't duplicate** — reference the canonical config file, don't copy its contents into the runbook
+8. **Test runbooks like you test code** — untested runbooks are worse than no runbooks (false confidence)