add brain

2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions
--- a/.brain/.agent/skills/engineering-advanced-skills/performance-profiler/SKILL.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/performance-profiler/SKILL.md
@@ -0,0 +1,155 @@
+---
+name: "performance-profiler"
+description: "Performance Profiler"
+---
+
+# Performance Profiler
+
+**Tier:** POWERFUL  
+**Category:** Engineering  
+**Domain:** Performance Engineering  
+
+---
+
+## Overview
+
+Systematic performance profiling for Node.js, Python, and Go applications. Identifies CPU, memory, and I/O bottlenecks; generates flamegraphs; analyzes bundle sizes; optimizes database queries; detects memory leaks; and runs load tests with k6 and Artillery. Always measures before and after.
+
+## Core Capabilities
+
+- **CPU profiling** — flamegraphs for Node.js, py-spy for Python, pprof for Go
+- **Memory profiling** — heap snapshots, leak detection, GC pressure
+- **Bundle analysis** — webpack-bundle-analyzer, Next.js bundle analyzer
+- **Database optimization** — EXPLAIN ANALYZE, slow query log, N+1 detection
+- **Load testing** — k6 scripts, Artillery scenarios, ramp-up patterns
+- **Before/after measurement** — establish baseline, profile, optimize, verify
+
+---
+
+## When to Use
+
+- App is slow and you don't know where the bottleneck is
+- P99 latency exceeds SLA before a release
+- Memory usage grows over time (suspected leak)
+- Bundle size increased after adding dependencies
+- Preparing for a traffic spike (load test before launch)
+- Database queries taking >100ms
+
+---
+
+## Golden Rule: Measure First
+
+```bash
+# Establish baseline BEFORE any optimization
+# Record: P50, P95, P99 latency | RPS | error rate | memory usage
+
+# Wrong: "I think the N+1 query is slow, let me fix it"
+# Right: Profile → confirm bottleneck → fix → measure again → verify improvement
+```
+
+---
+
+## Node.js Profiling
+→ See references/profiling-recipes.md for details
+
+## Before/After Measurement Template
+
+```markdown
+## Performance Optimization: [What You Fixed]
+
+**Date:** 2026-03-01  
+**Engineer:** @username  
+**Ticket:** PROJ-123  
+
+### Problem
+[1-2 sentences: what was slow, how was it observed]
+
+### Root Cause
+[What the profiler revealed]
+
+### Baseline (Before)
+| Metric | Value |
+|--------|-------|
+| P50 latency | 480ms |
+| P95 latency | 1,240ms |
+| P99 latency | 3,100ms |
+| RPS @ 50 VUs | 42 |
+| Error rate | 0.8% |
+| DB queries/req | 23 (N+1) |
+
+Profiler evidence: [link to flamegraph or screenshot]
+
+### Fix Applied
+[What changed — code diff or description]
+
+### After
+| Metric | Before | After | Delta |
+|--------|--------|-------|-------|
+| P50 latency | 480ms | 48ms | -90% |
+| P95 latency | 1,240ms | 120ms | -90% |
+| P99 latency | 3,100ms | 280ms | -91% |
+| RPS @ 50 VUs | 42 | 380 | +804% |
+| Error rate | 0.8% | 0% | -100% |
+| DB queries/req | 23 | 1 | -96% |
+
+### Verification
+Load test run: [link to k6 output]
+```
+
+---
+
+## Optimization Checklist
+
+### Quick wins (check these first)
+
+```
+Database
+□ Missing indexes on WHERE/ORDER BY columns
+□ N+1 queries (check query count per request)
+□ Loading all columns when only 2-3 needed (SELECT *)
+□ No LIMIT on unbounded queries
+□ Missing connection pool (creating new connection per request)
+
+Node.js
+□ Sync I/O (fs.readFileSync) in hot path
+□ JSON.parse/stringify of large objects in hot loop
+□ Missing caching for expensive computations
+□ No compression (gzip/brotli) on responses
+□ Dependencies loaded in request handler (move to module level)
+
+Bundle
+□ Moment.js → dayjs/date-fns
+□ Lodash (full) → lodash/function imports
+□ Static imports of heavy components → dynamic imports
+□ Images not optimized / not using next/image
+□ No code splitting on routes
+
+API
+□ No pagination on list endpoints
+□ No response caching (Cache-Control headers)
+□ Serial awaits that could be parallel (Promise.all)
+□ Fetching related data in a loop instead of JOIN
+```
+
+---
+
+## Common Pitfalls
+
+- **Optimizing without measuring** — you'll optimize the wrong thing
+- **Testing in development** — profile against production-like data volumes
+- **Ignoring P99** — P50 can look fine while P99 is catastrophic
+- **Premature optimization** — fix correctness first, then performance
+- **Not re-measuring** — always verify the fix actually improved things
+- **Load testing production** — use staging with production-size data
+
+---
+
+## Best Practices
+
+1. **Baseline first, always** — record metrics before touching anything
+2. **One change at a time** — isolate the variable to confirm causation
+3. **Profile with realistic data** — 10 rows in dev, millions in prod — different bottlenecks
+4. **Set performance budgets** — `p(95) < 200ms` in CI thresholds with k6
+5. **Monitor continuously** — add Datadog/Prometheus metrics for key paths
+6. **Cache invalidation strategy** — cache aggressively, invalidate precisely
+7. **Document the win** — before/after in the PR description motivates the team
--- a/.brain/.agent/skills/engineering-advanced-skills/performance-profiler/references/profiling-recipes.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/performance-profiler/references/profiling-recipes.md
@@ -0,0 +1,475 @@
+# performance-profiler reference
+
+## Node.js Profiling
+
+### CPU Flamegraph
+
+```bash
+# Method 1: clinic.js (best for development)
+npm install -g clinic
+
+# CPU flamegraph
+clinic flame -- node dist/server.js
+
+# Heap profiler
+clinic heapprofiler -- node dist/server.js
+
+# Bubble chart (event loop blocking)
+clinic bubbles -- node dist/server.js
+
+# Load with autocannon while profiling
+autocannon -c 50 -d 30 http://localhost:3000/api/tasks &
+clinic flame -- node dist/server.js
+```
+
+```bash
+# Method 2: Node.js built-in profiler
+node --prof dist/server.js
+# After running some load:
+node --prof-process isolate-*.log | head -100
+```
+
+```bash
+# Method 3: V8 CPU profiler via inspector
+node --inspect dist/server.js
+# Open Chrome DevTools → Performance → Record
+```
+
+### Heap Snapshot / Memory Leak Detection
+
+```javascript
+// Add to your server for on-demand heap snapshots
+import v8 from 'v8'
+import fs from 'fs'
+
+// Endpoint: POST /debug/heap-snapshot (protect with auth!)
+app.post('/debug/heap-snapshot', (req, res) => {
+  const filename = `heap-${Date.now()}.heapsnapshot`
+  const snapshot = v8.writeHeapSnapshot(filename)
+  res.json({ snapshot })
+})
+```
+
+```bash
+# Take snapshots over time and compare in Chrome DevTools
+curl -X POST http://localhost:3000/debug/heap-snapshot
+# Wait 5 minutes of load
+curl -X POST http://localhost:3000/debug/heap-snapshot
+# Open both snapshots in Chrome → Memory → Compare
+```
+
+### Detect Event Loop Blocking
+
+```javascript
+// Add blocked-at to detect synchronous blocking
+import blocked from 'blocked-at'
+
+blocked((time, stack) => {
+  console.warn(`Event loop blocked for ${time}ms`)
+  console.warn(stack.join('\n'))
+}, { threshold: 100 }) // Alert if blocked > 100ms
+```
+
+### Node.js Memory Profiling Script
+
+```javascript
+// scripts/memory-profile.mjs
+// Run: node --experimental-vm-modules scripts/memory-profile.mjs
+
+import { createRequire } from 'module'
+const require = createRequire(import.meta.url)
+
+function formatBytes(bytes) {
+  return (bytes / 1024 / 1024).toFixed(2) + ' MB'
+}
+
+function measureMemory(label) {
+  const mem = process.memoryUsage()
+  console.log(`\n[${label}]`)
+  console.log(`  RSS:       ${formatBytes(mem.rss)}`)
+  console.log(`  Heap Used: ${formatBytes(mem.heapUsed)}`)
+  console.log(`  Heap Total:${formatBytes(mem.heapTotal)}`)
+  console.log(`  External:  ${formatBytes(mem.external)}`)
+  return mem
+}
+
+const baseline = measureMemory('Baseline')
+
+// Simulate your operation
+for (let i = 0; i < 1000; i++) {
+  // Replace with your actual operation
+  const result = await someOperation()
+}
+
+const after = measureMemory('After 1000 operations')
+
+console.log(`\n[Delta]`)
+console.log(`  Heap Used: +${formatBytes(after.heapUsed - baseline.heapUsed)}`)
+
+// If heap keeps growing across GC cycles, you have a leak
+global.gc?.() // Run with --expose-gc flag
+const afterGC = measureMemory('After GC')
+if (afterGC.heapUsed > baseline.heapUsed * 1.1) {
+  console.warn('⚠️  Possible memory leak detected (>10% growth after GC)')
+}
+```
+
+---
+
+## Python Profiling
+
+### CPU Profiling with py-spy
+
+```bash
+# Install
+pip install py-spy
+
+# Profile a running process (no code changes needed)
+py-spy top --pid $(pgrep -f "uvicorn")
+
+# Generate flamegraph SVG
+py-spy record -o flamegraph.svg --pid $(pgrep -f "uvicorn") --duration 30
+
+# Profile from the start
+py-spy record -o flamegraph.svg -- python -m uvicorn app.main:app
+
+# Open flamegraph.svg in browser — look for wide bars = hot code paths
+```
+
+### cProfile for function-level profiling
+
+```python
+# scripts/profile_endpoint.py
+import cProfile
+import pstats
+import io
+from app.services.task_service import TaskService
+
+def run():
+    service = TaskService()
+    for _ in range(100):
+        service.list_tasks(user_id="user_1", page=1, limit=20)
+
+profiler = cProfile.Profile()
+profiler.enable()
+run()
+profiler.disable()
+
+# Print top 20 functions by cumulative time
+stream = io.StringIO()
+stats = pstats.Stats(profiler, stream=stream)
+stats.sort_stats('cumulative')
+stats.print_stats(20)
+print(stream.getvalue())
+```
+
+### Memory profiling with memory_profiler
+
+```python
+# pip install memory-profiler
+from memory_profiler import profile
+
+@profile
+def my_function():
+    # Function to profile
+    data = load_large_dataset()
+    result = process(data)
+    return result
+```
+
+```bash
+# Run with line-by-line memory tracking
+python -m memory_profiler scripts/profile_function.py
+
+# Output:
+# Line #    Mem usage    Increment   Line Contents
+# ================================================
+#     10   45.3 MiB   45.3 MiB   def my_function():
+#     11   78.1 MiB   32.8 MiB       data = load_large_dataset()
+#     12  156.2 MiB   78.1 MiB       result = process(data)
+```
+
+---
+
+## Go Profiling with pprof
+
+```go
+// main.go — add pprof endpoints
+import _ "net/http/pprof"
+import "net/http"
+
+func main() {
+    // pprof endpoints at /debug/pprof/
+    go func() {
+        log.Println(http.ListenAndServe(":6060", nil))
+    }()
+    // ... rest of your app
+}
+```
+
+```bash
+# CPU profile (30s)
+go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
+
+# Memory profile
+go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap
+
+# Goroutine leak detection
+curl http://localhost:6060/debug/pprof/goroutine?debug=1
+
+# In pprof UI: "Flame Graph" view → find the tallest bars
+```
+
+---
+
+## Bundle Size Analysis
+
+### Next.js Bundle Analyzer
+
+```bash
+# Install
+pnpm add -D @next/bundle-analyzer
+
+# next.config.js
+const withBundleAnalyzer = require('@next/bundle-analyzer')({
+  enabled: process.env.ANALYZE === 'true',
+})
+module.exports = withBundleAnalyzer({})
+
+# Run analyzer
+ANALYZE=true pnpm build
+# Opens browser with treemap of bundle
+```
+
+### What to look for
+
+```bash
+# Find the largest chunks
+pnpm build 2>&1 | grep -E "^\s+(λ|○|●)" | sort -k4 -rh | head -20
+
+# Check if a specific package is too large
+# Visit: https://bundlephobia.com/package/moment@2.29.4
+# moment: 67.9kB gzipped → replace with date-fns (13.8kB) or dayjs (6.9kB)
+
+# Find duplicate packages
+pnpm dedupe --check
+
+# Visualize what's in a chunk
+npx source-map-explorer .next/static/chunks/*.js
+```
+
+### Common bundle wins
+
+```typescript
+// Before: import entire lodash
+import _ from 'lodash'  // 71kB
+
+// After: import only what you need
+import debounce from 'lodash/debounce'  // 2kB
+
+// Before: moment.js
+import moment from 'moment'  // 67kB
+
+// After: dayjs
+import dayjs from 'dayjs'  // 7kB
+
+// Before: static import (always in bundle)
+import HeavyChart from '@/components/HeavyChart'
+
+// After: dynamic import (loaded on demand)
+const HeavyChart = dynamic(() => import('@/components/HeavyChart'), {
+  loading: () => <Skeleton />,
+})
+```
+
+---
+
+## Database Query Optimization
+
+### Find slow queries
+
+```sql
+-- PostgreSQL: enable pg_stat_statements
+CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
+
+-- Top 20 slowest queries
+SELECT
+  round(mean_exec_time::numeric, 2) AS mean_ms,
+  calls,
+  round(total_exec_time::numeric, 2) AS total_ms,
+  round(stddev_exec_time::numeric, 2) AS stddev_ms,
+  left(query, 80) AS query
+FROM pg_stat_statements
+WHERE calls > 10
+ORDER BY mean_exec_time DESC
+LIMIT 20;
+
+-- Reset stats
+SELECT pg_stat_statements_reset();
+```
+
+```bash
+# MySQL slow query log
+mysql -e "SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 0.1;"
+tail -f /var/log/mysql/slow-query.log
+```
+
+### EXPLAIN ANALYZE
+
+```sql
+-- Always use EXPLAIN (ANALYZE, BUFFERS) for real timing
+EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
+SELECT t.*, u.name as assignee_name
+FROM tasks t
+LEFT JOIN users u ON u.id = t.assignee_id
+WHERE t.project_id = 'proj_123'
+  AND t.deleted_at IS NULL
+ORDER BY t.created_at DESC
+LIMIT 20;
+
+-- Look for:
+-- Seq Scan on large table → needs index
+-- Nested Loop with high rows → N+1, consider JOIN or batch
+-- Sort → can index handle the sort?
+-- Hash Join → fine for moderate sizes
+```
+
+### Detect N+1 Queries
+
+```typescript
+// Add query logging in dev
+import { db } from './client'
+
+// Drizzle: enable logging
+const db = drizzle(pool, { logger: true })
+
+// Or use a query counter middleware
+let queryCount = 0
+db.$on('query', () => queryCount++)
+
+// In tests:
+queryCount = 0
+const tasks = await getTasksWithAssignees(projectId)
+expect(queryCount).toBe(1)  // Fail if it's 21 (1 + 20 N+1s)
+```
+
+```python
+# Django: detect N+1 with django-silk or nplusone
+from nplusone.ext.django.middleware import NPlusOneMiddleware
+MIDDLEWARE = ['nplusone.ext.django.middleware.NPlusOneMiddleware']
+NPLUSONE_RAISE = True  # Raise exception on N+1 in tests
+```
+
+### Fix N+1 — Before/After
+
+```typescript
+// Before: N+1 (1 query for tasks + N queries for assignees)
+const tasks = await db.select().from(tasksTable)
+for (const task of tasks) {
+  task.assignee = await db.select().from(usersTable)
+    .where(eq(usersTable.id, task.assigneeId))
+    .then(r => r[0])
+}
+
+// After: 1 query with JOIN
+const tasks = await db
+  .select({
+    id: tasksTable.id,
+    title: tasksTable.title,
+    assigneeName: usersTable.name,
+    assigneeEmail: usersTable.email,
+  })
+  .from(tasksTable)
+  .leftJoin(usersTable, eq(usersTable.id, tasksTable.assigneeId))
+  .where(eq(tasksTable.projectId, projectId))
+```
+
+---
+
+## Load Testing with k6
+
+```javascript
+// tests/load/api-load-test.js
+import http from 'k6/http'
+import { check, sleep } from 'k6'
+import { Rate, Trend } from 'k6/metrics'
+
+const errorRate = new Rate('errors')
+const taskListDuration = new Trend('task_list_duration')
+
+export const options = {
+  stages: [
+    { duration: '30s', target: 10 },   // Ramp up to 10 VUs
+    { duration: '1m',  target: 50 },   // Ramp to 50 VUs
+    { duration: '2m',  target: 50 },   // Sustain 50 VUs
+    { duration: '30s', target: 100 },  // Spike to 100 VUs
+    { duration: '1m',  target: 50 },   // Back to 50
+    { duration: '30s', target: 0 },    // Ramp down
+  ],
+  thresholds: {
+    http_req_duration: ['p(95)<500'],   // 95% of requests < 500ms
+    http_req_duration: ['p(99)<1000'],  // 99% < 1s
+    errors: ['rate<0.01'],              // Error rate < 1%
+    task_list_duration: ['p(95)<200'],  // Task list specifically < 200ms
+  },
+}
+
+const BASE_URL = __ENV.BASE_URL || 'http://localhost:3000'
+
+export function setup() {
+  // Get auth token once
+  const loginRes = http.post(`${BASE_URL}/api/auth/login`, JSON.stringify({
+    email: 'loadtest@example.com',
+    password: 'loadtest123',
+  }), { headers: { 'Content-Type': 'application/json' } })
+  
+  return { token: loginRes.json('token') }
+}
+
+export default function(data) {
+  const headers = {
+    'Authorization': `Bearer ${data.token}`,
+    'Content-Type': 'application/json',
+  }
+  
+  // Scenario 1: List tasks
+  const start = Date.now()
+  const listRes = http.get(`${BASE_URL}/api/tasks?limit=20`, { headers })
+  taskListDuration.add(Date.now() - start)
+  
+  check(listRes, {
+    'list tasks: status 200': (r) => r.status === 200,
+    'list tasks: has items': (r) => r.json('items') !== undefined,
+  }) || errorRate.add(1)
+  
+  sleep(0.5)
+  
+  // Scenario 2: Create task
+  const createRes = http.post(
+    `${BASE_URL}/api/tasks`,
+    JSON.stringify({ title: `Load test task ${Date.now()}`, priority: 'medium' }),
+    { headers }
+  )
+  
+  check(createRes, {
+    'create task: status 201': (r) => r.status === 201,
+  }) || errorRate.add(1)
+  
+  sleep(1)
+}
+
+export function teardown(data) {
+  // Cleanup: delete load test tasks
+}
+```
+
+```bash
+# Run load test
+k6 run tests/load/api-load-test.js \
+  --env BASE_URL=https://staging.myapp.com
+
+# With Grafana output
+k6 run --out influxdb=http://localhost:8086/k6 tests/load/api-load-test.js
+```
+
+---