add brain

2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions
--- a/.brain/.agent/skills/engineering-advanced-skills/migration-architect/references/data_reconciliation_strategies.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/migration-architect/references/data_reconciliation_strategies.md
--- a/.brain/.agent/skills/engineering-advanced-skills/migration-architect/references/migration_patterns_catalog.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/migration-architect/references/migration_patterns_catalog.md
@@ -0,0 +1,705 @@
+# Migration Patterns Catalog
+
+## Overview
+
+This catalog provides detailed descriptions of proven migration patterns, their use cases, implementation guidelines, and best practices. Each pattern includes code examples, diagrams, and lessons learned from real-world implementations.
+
+## Database Migration Patterns
+
+### 1. Expand-Contract Pattern
+
+**Use Case:** Schema evolution with zero downtime
+**Complexity:** Medium
+**Risk Level:** Low-Medium
+
+#### Description
+The Expand-Contract pattern allows for schema changes without downtime by following a three-phase approach:
+
+1. **Expand:** Add new schema elements alongside existing ones
+2. **Migrate:** Dual-write to both old and new schema during transition
+3. **Contract:** Remove old schema elements after validation
+
+#### Implementation Steps
+
+```sql
+-- Phase 1: Expand
+ALTER TABLE users ADD COLUMN email_new VARCHAR(255);
+CREATE INDEX CONCURRENTLY idx_users_email_new ON users(email_new);
+
+-- Phase 2: Migrate (Application Code)
+-- Write to both columns during transition period
+INSERT INTO users (name, email, email_new) VALUES (?, ?, ?);
+
+-- Backfill existing data
+UPDATE users SET email_new = email WHERE email_new IS NULL;
+
+-- Phase 3: Contract (after validation)
+ALTER TABLE users DROP COLUMN email;
+ALTER TABLE users RENAME COLUMN email_new TO email;
+```
+
+#### Pros and Cons
+**Pros:**
+- Zero downtime deployments
+- Safe rollback at any point
+- Gradual transition with validation
+
+**Cons:**
+- Increased storage during transition
+- More complex application logic
+- Extended migration timeline
+
+### 2. Parallel Schema Pattern
+
+**Use Case:** Major database restructuring
+**Complexity:** High
+**Risk Level:** Medium
+
+#### Description
+Run new and old schemas in parallel, using feature flags to gradually route traffic to the new schema while maintaining the ability to rollback quickly.
+
+#### Implementation Example
+
+```python
+class DatabaseRouter:
+    def __init__(self, feature_flag_service):
+        self.feature_flags = feature_flag_service
+        self.old_db = OldDatabaseConnection()
+        self.new_db = NewDatabaseConnection()
+    
+    def route_query(self, user_id, query_type):
+        if self.feature_flags.is_enabled("new_schema", user_id):
+            return self.new_db.execute(query_type)
+        else:
+            return self.old_db.execute(query_type)
+    
+    def dual_write(self, data):
+        # Write to both databases for consistency
+        success_old = self.old_db.write(data)
+        success_new = self.new_db.write(transform_data(data))
+        
+        if not (success_old and success_new):
+            # Handle partial failures
+            self.handle_dual_write_failure(data, success_old, success_new)
+```
+
+#### Best Practices
+- Implement data consistency checks between schemas
+- Use circuit breakers for automatic failover
+- Monitor performance impact of dual writes
+- Plan for data reconciliation processes
+
+### 3. Event Sourcing Migration
+
+**Use Case:** Migrating systems with complex business logic
+**Complexity:** High
+**Risk Level:** Medium-High
+
+#### Description
+Capture all changes as events during migration, enabling replay and reconciliation capabilities.
+
+#### Event Store Schema
+```sql
+CREATE TABLE migration_events (
+    event_id UUID PRIMARY KEY,
+    aggregate_id UUID NOT NULL,
+    event_type VARCHAR(100) NOT NULL,
+    event_data JSONB NOT NULL,
+    event_version INTEGER NOT NULL,
+    occurred_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
+    processed_at TIMESTAMP WITH TIME ZONE
+);
+```
+
+#### Migration Event Handler
+```python
+class MigrationEventHandler:
+    def __init__(self, old_store, new_store):
+        self.old_store = old_store
+        self.new_store = new_store
+        self.event_log = []
+    
+    def handle_update(self, entity_id, old_data, new_data):
+        # Log the change as an event
+        event = MigrationEvent(
+            entity_id=entity_id,
+            event_type="entity_migrated",
+            old_data=old_data,
+            new_data=new_data,
+            timestamp=datetime.now()
+        )
+        
+        self.event_log.append(event)
+        
+        # Apply to new store
+        success = self.new_store.update(entity_id, new_data)
+        
+        if not success:
+            # Mark for retry
+            event.status = "failed"
+            self.schedule_retry(event)
+        
+        return success
+    
+    def replay_events(self, from_timestamp=None):
+        """Replay events for reconciliation"""
+        events = self.get_events_since(from_timestamp)
+        for event in events:
+            self.apply_event(event)
+```
+
+## Service Migration Patterns
+
+### 1. Strangler Fig Pattern
+
+**Use Case:** Legacy system replacement
+**Complexity:** Medium-High
+**Risk Level:** Medium
+
+#### Description
+Gradually replace legacy functionality by intercepting calls and routing them to new services, eventually "strangling" the legacy system.
+
+#### Implementation Architecture
+
+```yaml
+# API Gateway Configuration
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: user-service-migration
+spec:
+  http:
+  - match:
+    - headers:
+        migration-flag:
+          exact: "new"
+    route:
+    - destination:
+        host: user-service-v2
+  - route:
+    - destination:
+        host: user-service-v1
+```
+
+#### Strangler Proxy Implementation
+
+```python
+class StranglerProxy:
+    def __init__(self):
+        self.legacy_service = LegacyUserService()
+        self.new_service = NewUserService()
+        self.feature_flags = FeatureFlagService()
+    
+    def handle_request(self, request):
+        route = self.determine_route(request)
+        
+        if route == "new":
+            return self.handle_with_new_service(request)
+        elif route == "both":
+            return self.handle_with_both_services(request)
+        else:
+            return self.handle_with_legacy_service(request)
+    
+    def determine_route(self, request):
+        user_id = request.get('user_id')
+        
+        if self.feature_flags.is_enabled("new_user_service", user_id):
+            if self.feature_flags.is_enabled("dual_write", user_id):
+                return "both"
+            else:
+                return "new"
+        else:
+            return "legacy"
+```
+
+### 2. Parallel Run Pattern
+
+**Use Case:** Risk mitigation for critical services
+**Complexity:** Medium
+**Risk Level:** Low-Medium
+
+#### Description
+Run both old and new services simultaneously, comparing outputs to validate correctness before switching traffic.
+
+#### Implementation
+
+```python
+class ParallelRunManager:
+    def __init__(self):
+        self.primary_service = PrimaryService()
+        self.candidate_service = CandidateService()
+        self.comparator = ResponseComparator()
+        self.metrics = MetricsCollector()
+    
+    async def parallel_execute(self, request):
+        # Execute both services concurrently
+        primary_task = asyncio.create_task(
+            self.primary_service.process(request)
+        )
+        candidate_task = asyncio.create_task(
+            self.candidate_service.process(request)
+        )
+        
+        # Always wait for primary
+        primary_result = await primary_task
+        
+        try:
+            # Wait for candidate with timeout
+            candidate_result = await asyncio.wait_for(
+                candidate_task, timeout=5.0
+            )
+            
+            # Compare results
+            comparison = self.comparator.compare(
+                primary_result, candidate_result
+            )
+            
+            # Record metrics
+            self.metrics.record_comparison(comparison)
+            
+        except asyncio.TimeoutError:
+            self.metrics.record_timeout("candidate")
+        except Exception as e:
+            self.metrics.record_error("candidate", str(e))
+        
+        # Always return primary result
+        return primary_result
+```
+
+### 3. Blue-Green Deployment Pattern
+
+**Use Case:** Zero-downtime service updates
+**Complexity:** Low-Medium
+**Risk Level:** Low
+
+#### Description
+Maintain two identical production environments (blue and green), switching traffic between them for deployments.
+
+#### Kubernetes Implementation
+
+```yaml
+# Blue Deployment
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: app-blue
+  labels:
+    version: blue
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: myapp
+      version: blue
+  template:
+    metadata:
+      labels:
+        app: myapp
+        version: blue
+    spec:
+      containers:
+      - name: app
+        image: myapp:v1.0.0
+
+---
+# Green Deployment
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: app-green
+  labels:
+    version: green
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: myapp
+      version: green
+  template:
+    metadata:
+      labels:
+        app: myapp
+        version: green
+    spec:
+      containers:
+      - name: app
+        image: myapp:v2.0.0
+
+---
+# Service (switches between blue and green)
+apiVersion: v1
+kind: Service
+metadata:
+  name: app-service
+spec:
+  selector:
+    app: myapp
+    version: blue  # Change to green for deployment
+  ports:
+  - port: 80
+    targetPort: 8080
+```
+
+## Infrastructure Migration Patterns
+
+### 1. Lift and Shift Pattern
+
+**Use Case:** Quick cloud migration with minimal changes
+**Complexity:** Low-Medium
+**Risk Level:** Low
+
+#### Description
+Migrate applications to cloud infrastructure with minimal or no code changes, focusing on infrastructure compatibility.
+
+#### Migration Checklist
+
+```yaml
+Pre-Migration Assessment:
+  - inventory_current_infrastructure:
+      - servers_and_specifications
+      - network_configuration
+      - storage_requirements
+      - security_configurations
+  - identify_dependencies:
+      - database_connections
+      - external_service_integrations
+      - file_system_dependencies
+  - assess_compatibility:
+      - operating_system_versions
+      - runtime_dependencies
+      - license_requirements
+
+Migration Execution:
+  - provision_target_infrastructure:
+      - compute_instances
+      - storage_volumes
+      - network_configuration
+      - security_groups
+  - migrate_data:
+      - database_backup_restore
+      - file_system_replication
+      - configuration_files
+  - update_configurations:
+      - connection_strings
+      - environment_variables
+      - dns_records
+  - validate_functionality:
+      - application_health_checks
+      - end_to_end_testing
+      - performance_validation
+```
+
+### 2. Hybrid Cloud Migration
+
+**Use Case:** Gradual cloud adoption with on-premises integration
+**Complexity:** High
+**Risk Level:** Medium-High
+
+#### Description
+Maintain some components on-premises while migrating others to cloud, requiring secure connectivity and data synchronization.
+
+#### Network Architecture
+
+```hcl
+# Terraform configuration for hybrid connectivity
+resource "aws_vpc" "main" {
+  cidr_block           = "10.0.0.0/16"
+  enable_dns_hostnames = true
+  enable_dns_support   = true
+}
+
+resource "aws_vpn_gateway" "main" {
+  vpc_id = aws_vpc.main.id
+  
+  tags = {
+    Name = "hybrid-vpn-gateway"
+  }
+}
+
+resource "aws_customer_gateway" "main" {
+  bgp_asn    = 65000
+  ip_address = var.on_premises_public_ip
+  type       = "ipsec.1"
+  
+  tags = {
+    Name = "on-premises-gateway"
+  }
+}
+
+resource "aws_vpn_connection" "main" {
+  vpn_gateway_id      = aws_vpn_gateway.main.id
+  customer_gateway_id = aws_customer_gateway.main.id
+  type                = "ipsec.1"
+  static_routes_only  = true
+}
+```
+
+#### Data Synchronization Pattern
+
+```python
+class HybridDataSync:
+    def __init__(self):
+        self.on_prem_db = OnPremiseDatabase()
+        self.cloud_db = CloudDatabase()
+        self.sync_log = SyncLogManager()
+    
+    async def bidirectional_sync(self):
+        """Synchronize data between on-premises and cloud"""
+        
+        # Get last sync timestamp
+        last_sync = self.sync_log.get_last_sync_time()
+        
+        # Sync on-prem changes to cloud
+        on_prem_changes = self.on_prem_db.get_changes_since(last_sync)
+        for change in on_prem_changes:
+            await self.apply_change_to_cloud(change)
+        
+        # Sync cloud changes to on-prem
+        cloud_changes = self.cloud_db.get_changes_since(last_sync)
+        for change in cloud_changes:
+            await self.apply_change_to_on_prem(change)
+        
+        # Handle conflicts
+        conflicts = self.detect_conflicts(on_prem_changes, cloud_changes)
+        for conflict in conflicts:
+            await self.resolve_conflict(conflict)
+        
+        # Update sync timestamp
+        self.sync_log.record_sync_completion()
+    
+    async def apply_change_to_cloud(self, change):
+        """Apply on-premises change to cloud database"""
+        try:
+            if change.operation == "INSERT":
+                await self.cloud_db.insert(change.table, change.data)
+            elif change.operation == "UPDATE":
+                await self.cloud_db.update(change.table, change.key, change.data)
+            elif change.operation == "DELETE":
+                await self.cloud_db.delete(change.table, change.key)
+                
+            self.sync_log.record_success(change.id, "cloud")
+            
+        except Exception as e:
+            self.sync_log.record_failure(change.id, "cloud", str(e))
+            raise
+```
+
+### 3. Multi-Cloud Migration
+
+**Use Case:** Avoiding vendor lock-in or regulatory requirements
+**Complexity:** Very High
+**Risk Level:** High
+
+#### Description
+Distribute workloads across multiple cloud providers for resilience, compliance, or cost optimization.
+
+#### Service Mesh Configuration
+
+```yaml
+# Istio configuration for multi-cloud service mesh
+apiVersion: networking.istio.io/v1beta1
+kind: ServiceEntry
+metadata:
+  name: aws-service
+spec:
+  hosts:
+  - aws-service.company.com
+  ports:
+  - number: 443
+    name: https
+    protocol: HTTPS
+  location: MESH_EXTERNAL
+  resolution: DNS
+
+---
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: multi-cloud-routing
+spec:
+  hosts:
+  - user-service
+  http:
+  - match:
+    - headers:
+        region:
+          exact: "us-east"
+    route:
+    - destination:
+        host: aws-service.company.com
+      weight: 100
+  - match:
+    - headers:
+        region:
+          exact: "eu-west"
+    route:
+    - destination:
+        host: gcp-service.company.com
+      weight: 100
+  - route:  # Default routing
+    - destination:
+        host: user-service
+        subset: local
+      weight: 80
+    - destination:
+        host: aws-service.company.com
+      weight: 20
+```
+
+## Feature Flag Patterns
+
+### 1. Progressive Rollout Pattern
+
+**Use Case:** Gradual feature deployment with risk mitigation
+**Implementation:**
+
+```python
+class ProgressiveRollout:
+    def __init__(self, feature_name):
+        self.feature_name = feature_name
+        self.rollout_percentage = 0
+        self.user_buckets = {}
+        
+    def is_enabled_for_user(self, user_id):
+        # Consistent user bucketing
+        user_hash = hashlib.md5(f"{self.feature_name}:{user_id}".encode()).hexdigest()
+        bucket = int(user_hash, 16) % 100
+        
+        return bucket < self.rollout_percentage
+    
+    def increase_rollout(self, target_percentage, step_size=10):
+        """Gradually increase rollout percentage"""
+        while self.rollout_percentage < target_percentage:
+            self.rollout_percentage = min(
+                self.rollout_percentage + step_size,
+                target_percentage
+            )
+            
+            # Monitor metrics before next increase
+            yield self.rollout_percentage
+            time.sleep(300)  # Wait 5 minutes between increases
+```
+
+### 2. Circuit Breaker Pattern
+
+**Use Case:** Automatic fallback during migration issues
+
+```python
+class MigrationCircuitBreaker:
+    def __init__(self, failure_threshold=5, timeout=60):
+        self.failure_count = 0
+        self.failure_threshold = failure_threshold
+        self.timeout = timeout
+        self.last_failure_time = None
+        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN
+    
+    def call_new_service(self, request):
+        if self.state == 'OPEN':
+            if self.should_attempt_reset():
+                self.state = 'HALF_OPEN'
+            else:
+                return self.fallback_to_legacy(request)
+        
+        try:
+            response = self.new_service.process(request)
+            self.on_success()
+            return response
+        except Exception as e:
+            self.on_failure()
+            return self.fallback_to_legacy(request)
+    
+    def on_success(self):
+        self.failure_count = 0
+        self.state = 'CLOSED'
+    
+    def on_failure(self):
+        self.failure_count += 1
+        self.last_failure_time = time.time()
+        
+        if self.failure_count >= self.failure_threshold:
+            self.state = 'OPEN'
+    
+    def should_attempt_reset(self):
+        return (time.time() - self.last_failure_time) >= self.timeout
+```
+
+## Migration Anti-Patterns
+
+### 1. Big Bang Migration (Anti-Pattern)
+
+**Why to Avoid:**
+- High risk of complete system failure
+- Difficult to rollback
+- Extended downtime
+- All-or-nothing deployment
+
+**Better Alternative:** Use incremental migration patterns like Strangler Fig or Parallel Run.
+
+### 2. No Rollback Plan (Anti-Pattern)
+
+**Why to Avoid:**
+- Cannot recover from failures
+- Increases business risk
+- Panic-driven decisions during issues
+
+**Better Alternative:** Always implement comprehensive rollback procedures before migration.
+
+### 3. Insufficient Testing (Anti-Pattern)
+
+**Why to Avoid:**
+- Unknown compatibility issues
+- Performance degradation
+- Data corruption risks
+
+**Better Alternative:** Implement comprehensive testing at each migration phase.
+
+## Pattern Selection Matrix
+
+| Migration Type | Complexity | Downtime Tolerance | Recommended Pattern |
+|---------------|------------|-------------------|-------------------|
+| Schema Change | Low | Zero | Expand-Contract |
+| Schema Change | High | Zero | Parallel Schema |
+| Service Replace | Medium | Zero | Strangler Fig |
+| Service Update | Low | Zero | Blue-Green |
+| Data Migration | High | Some | Event Sourcing |
+| Infrastructure | Low | Some | Lift and Shift |
+| Infrastructure | High | Zero | Hybrid Cloud |
+
+## Success Metrics
+
+### Technical Metrics
+- Migration completion rate
+- System availability during migration
+- Performance impact (response time, throughput)
+- Error rate changes
+- Rollback execution time
+
+### Business Metrics
+- Customer impact score
+- Revenue protection
+- Time to value realization
+- Stakeholder satisfaction
+
+### Operational Metrics
+- Team efficiency
+- Knowledge transfer effectiveness
+- Post-migration support requirements
+- Documentation completeness
+
+## Lessons Learned
+
+### Common Pitfalls
+1. **Underestimating data dependencies** - Always map all data relationships
+2. **Insufficient monitoring** - Implement comprehensive observability before migration
+3. **Poor communication** - Keep all stakeholders informed throughout the process
+4. **Rushed timelines** - Allow adequate time for testing and validation
+5. **Ignoring performance impact** - Benchmark before and after migration
+
+### Best Practices
+1. **Start with low-risk migrations** - Build confidence and experience
+2. **Automate everything possible** - Reduce human error and increase repeatability
+3. **Test rollback procedures** - Ensure you can recover from any failure
+4. **Monitor continuously** - Use real-time dashboards and alerting
+5. **Document everything** - Create comprehensive runbooks and documentation
+
+This catalog serves as a reference for selecting appropriate migration patterns based on specific requirements, risk tolerance, and technical constraints.
--- a/.brain/.agent/skills/engineering-advanced-skills/migration-architect/references/zero_downtime_techniques.md
+++ b/.brain/.agent/skills/engineering-advanced-skills/migration-architect/references/zero_downtime_techniques.md