# Zero-Downtime Migration Techniques ## Overview Zero-downtime migrations are critical for maintaining business continuity and user experience during system changes. This guide provides comprehensive techniques, patterns, and implementation strategies for achieving true zero-downtime migrations across different system components. ## Core Principles ### 1. Backward Compatibility Every change must be backward compatible until all clients have migrated to the new version. ### 2. Incremental Changes Break large changes into smaller, independent increments that can be deployed and validated separately. ### 3. Feature Flags Use feature toggles to control the rollout of new functionality without code deployments. ### 4. Graceful Degradation Ensure systems continue to function even when some components are unavailable or degraded. ## Database Zero-Downtime Techniques ### Schema Evolution Without Downtime #### 1. Additive Changes Only **Principle:** Only add new elements; never remove or modify existing ones directly. ```sql -- ✅ Good: Additive change ALTER TABLE users ADD COLUMN middle_name VARCHAR(50); -- ❌ Bad: Breaking change ALTER TABLE users DROP COLUMN email; ``` #### 2. Multi-Phase Schema Evolution **Phase 1: Expand** ```sql -- Add new column alongside existing one ALTER TABLE users ADD COLUMN email_address VARCHAR(255); -- Add index concurrently (PostgreSQL) CREATE INDEX CONCURRENTLY idx_users_email_address ON users(email_address); ``` **Phase 2: Dual Write (Application Code)** ```python class UserService: def create_user(self, name, email): # Write to both old and new columns user = User( name=name, email=email, # Old column email_address=email # New column ) return user.save() def update_email(self, user_id, new_email): # Update both columns user = User.objects.get(id=user_id) user.email = new_email user.email_address = new_email user.save() return user ``` **Phase 3: Backfill Data** ```sql -- Backfill existing data (in batches) UPDATE users SET email_address = email WHERE email_address IS NULL AND id BETWEEN ? AND ?; ``` **Phase 4: Switch Reads** ```python class UserService: def get_user_email(self, user_id): user = User.objects.get(id=user_id) # Switch to reading from new column return user.email_address or user.email ``` **Phase 5: Contract** ```sql -- After validation, remove old column ALTER TABLE users DROP COLUMN email; -- Rename new column if needed ALTER TABLE users RENAME COLUMN email_address TO email; ``` ### 3. Online Schema Changes #### PostgreSQL Techniques ```sql -- Safe column addition ALTER TABLE orders ADD COLUMN status_new VARCHAR(20) DEFAULT 'pending'; -- Safe index creation CREATE INDEX CONCURRENTLY idx_orders_status_new ON orders(status_new); -- Safe constraint addition (after data validation) ALTER TABLE orders ADD CONSTRAINT check_status_new CHECK (status_new IN ('pending', 'processing', 'completed', 'cancelled')); ``` #### MySQL Techniques ```sql -- Use pt-online-schema-change for large tables pt-online-schema-change \ --alter "ADD COLUMN status VARCHAR(20) DEFAULT 'pending'" \ --execute \ D=mydb,t=orders -- Online DDL (MySQL 5.6+) ALTER TABLE orders ADD COLUMN priority INT DEFAULT 1, ALGORITHM=INPLACE, LOCK=NONE; ``` ### 4. Data Migration Strategies #### Chunked Data Migration ```python class DataMigrator: def __init__(self, source_table, target_table, chunk_size=1000): self.source_table = source_table self.target_table = target_table self.chunk_size = chunk_size def migrate_data(self): last_id = 0 total_migrated = 0 while True: # Get next chunk chunk = self.get_chunk(last_id, self.chunk_size) if not chunk: break # Transform and migrate chunk for record in chunk: transformed = self.transform_record(record) self.insert_or_update(transformed) last_id = chunk[-1]['id'] total_migrated += len(chunk) # Brief pause to avoid overwhelming the database time.sleep(0.1) self.log_progress(total_migrated) return total_migrated def get_chunk(self, last_id, limit): return db.execute(f""" SELECT * FROM {self.source_table} WHERE id > %s ORDER BY id LIMIT %s """, (last_id, limit)) ``` #### Change Data Capture (CDC) ```python class CDCProcessor: def __init__(self): self.kafka_consumer = KafkaConsumer('db_changes') self.target_db = TargetDatabase() def process_changes(self): for message in self.kafka_consumer: change = json.loads(message.value) if change['operation'] == 'INSERT': self.handle_insert(change) elif change['operation'] == 'UPDATE': self.handle_update(change) elif change['operation'] == 'DELETE': self.handle_delete(change) def handle_insert(self, change): transformed_data = self.transform_data(change['after']) self.target_db.insert(change['table'], transformed_data) def handle_update(self, change): key = change['key'] transformed_data = self.transform_data(change['after']) self.target_db.update(change['table'], key, transformed_data) ``` ## Application Zero-Downtime Techniques ### 1. Blue-Green Deployments #### Infrastructure Setup ```yaml # Blue Environment (Current Production) apiVersion: apps/v1 kind: Deployment metadata: name: app-blue labels: version: blue app: myapp spec: replicas: 3 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: app image: myapp:1.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 15 periodSeconds: 10 --- # Green Environment (New Version) apiVersion: apps/v1 kind: Deployment metadata: name: app-green labels: version: green app: myapp spec: replicas: 3 selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: app image: myapp:2.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 ``` #### Service Switching ```yaml # Service (switches between blue and green) apiVersion: v1 kind: Service metadata: name: app-service spec: selector: app: myapp version: blue # Switch to 'green' for deployment ports: - port: 80 targetPort: 8080 type: LoadBalancer ``` #### Automated Deployment Script ```bash #!/bin/bash # Blue-Green Deployment Script NAMESPACE="production" APP_NAME="myapp" NEW_IMAGE="myapp:2.0.0" # Determine current and target environments CURRENT_VERSION=$(kubectl get service $APP_NAME-service -o jsonpath='{.spec.selector.version}') if [ "$CURRENT_VERSION" = "blue" ]; then TARGET_VERSION="green" else TARGET_VERSION="blue" fi echo "Current version: $CURRENT_VERSION" echo "Target version: $TARGET_VERSION" # Update target environment with new image kubectl set image deployment/$APP_NAME-$TARGET_VERSION app=$NEW_IMAGE # Wait for rollout to complete kubectl rollout status deployment/$APP_NAME-$TARGET_VERSION --timeout=300s # Run health checks echo "Running health checks..." TARGET_IP=$(kubectl get service $APP_NAME-$TARGET_VERSION -o jsonpath='{.status.loadBalancer.ingress[0].ip}') for i in {1..30}; do if curl -f http://$TARGET_IP/health; then echo "Health check passed" break fi if [ $i -eq 30 ]; then echo "Health check failed after 30 attempts" exit 1 fi sleep 2 done # Switch traffic to new version kubectl patch service $APP_NAME-service -p '{"spec":{"selector":{"version":"'$TARGET_VERSION'"}}}' echo "Traffic switched to $TARGET_VERSION" # Monitor for 5 minutes echo "Monitoring new version..." sleep 300 # Check if rollback is needed ERROR_RATE=$(curl -s "http://monitoring.company.com/api/error_rate?service=$APP_NAME" | jq '.error_rate') if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then echo "Error rate too high ($ERROR_RATE), rolling back..." kubectl patch service $APP_NAME-service -p '{"spec":{"selector":{"version":"'$CURRENT_VERSION'"}}}' exit 1 fi echo "Deployment successful!" ``` ### 2. Canary Deployments #### Progressive Canary with Istio ```yaml # Destination Rule apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp-destination spec: host: myapp subsets: - name: v1 labels: version: v1 - name: v2 labels: version: v2 --- # Virtual Service for Canary apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp-canary spec: hosts: - myapp http: - match: - headers: canary: exact: "true" route: - destination: host: myapp subset: v2 - route: - destination: host: myapp subset: v1 weight: 95 - destination: host: myapp subset: v2 weight: 5 ``` #### Automated Canary Controller ```python class CanaryController: def __init__(self, istio_client, prometheus_client): self.istio = istio_client self.prometheus = prometheus_client self.canary_weight = 5 self.max_weight = 100 self.weight_increment = 5 self.validation_window = 300 # 5 minutes async def deploy_canary(self, app_name, new_version): """Deploy new version using canary strategy""" # Start with small percentage await self.update_traffic_split(app_name, self.canary_weight) while self.canary_weight < self.max_weight: # Monitor metrics for validation window await asyncio.sleep(self.validation_window) # Check canary health if not await self.is_canary_healthy(app_name, new_version): await self.rollback_canary(app_name) raise Exception("Canary deployment failed health checks") # Increase traffic to canary self.canary_weight = min( self.canary_weight + self.weight_increment, self.max_weight ) await self.update_traffic_split(app_name, self.canary_weight) print(f"Canary traffic increased to {self.canary_weight}%") print("Canary deployment completed successfully") async def is_canary_healthy(self, app_name, version): """Check if canary version is healthy""" # Check error rate error_rate = await self.prometheus.query( f'rate(http_requests_total{{app="{app_name}", version="{version}", status=~"5.."}}' f'[5m]) / rate(http_requests_total{{app="{app_name}", version="{version}"}}[5m])' ) if error_rate > 0.05: # 5% error rate threshold return False # Check response time p95_latency = await self.prometheus.query( f'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket' f'{{app="{app_name}", version="{version}"}}[5m]))' ) if p95_latency > 2.0: # 2 second p95 threshold return False return True async def update_traffic_split(self, app_name, canary_weight): """Update Istio virtual service with new traffic split""" stable_weight = 100 - canary_weight virtual_service = { "apiVersion": "networking.istio.io/v1beta1", "kind": "VirtualService", "metadata": {"name": f"{app_name}-canary"}, "spec": { "hosts": [app_name], "http": [{ "route": [ { "destination": {"host": app_name, "subset": "stable"}, "weight": stable_weight }, { "destination": {"host": app_name, "subset": "canary"}, "weight": canary_weight } ] }] } } await self.istio.apply_virtual_service(virtual_service) ``` ### 3. Rolling Updates #### Kubernetes Rolling Update Strategy ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: rolling-update-app spec: replicas: 10 strategy: type: RollingUpdate rollingUpdate: maxSurge: 2 # Can have 2 extra pods during update maxUnavailable: 1 # At most 1 pod can be unavailable selector: matchLabels: app: rolling-update-app template: metadata: labels: app: rolling-update-app spec: containers: - name: app image: myapp:2.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 2 timeoutSeconds: 1 successThreshold: 1 failureThreshold: 3 livenessProbe: httpGet: path: /live port: 8080 initialDelaySeconds: 10 periodSeconds: 10 ``` #### Custom Rolling Update Controller ```python class RollingUpdateController: def __init__(self, k8s_client): self.k8s = k8s_client self.max_surge = 2 self.max_unavailable = 1 async def rolling_update(self, deployment_name, new_image): """Perform rolling update with custom logic""" deployment = await self.k8s.get_deployment(deployment_name) total_replicas = deployment.spec.replicas # Calculate batch size batch_size = min(self.max_surge, total_replicas // 5) # Update 20% at a time updated_pods = [] for i in range(0, total_replicas, batch_size): batch_end = min(i + batch_size, total_replicas) # Update batch of pods for pod_index in range(i, batch_end): old_pod = await self.get_pod_by_index(deployment_name, pod_index) # Create new pod with new image new_pod = await self.create_updated_pod(old_pod, new_image) # Wait for new pod to be ready await self.wait_for_pod_ready(new_pod.metadata.name) # Remove old pod await self.k8s.delete_pod(old_pod.metadata.name) updated_pods.append(new_pod) # Brief pause between pod updates await asyncio.sleep(2) # Validate batch health before continuing if not await self.validate_batch_health(updated_pods[-batch_size:]): # Rollback batch await self.rollback_batch(updated_pods[-batch_size:]) raise Exception("Rolling update failed validation") print(f"Updated {batch_end}/{total_replicas} pods") print("Rolling update completed successfully") ``` ## Load Balancer and Traffic Management ### 1. Weighted Routing #### NGINX Configuration ```nginx upstream backend { # Old version - 80% traffic server old-app-1:8080 weight=4; server old-app-2:8080 weight=4; # New version - 20% traffic server new-app-1:8080 weight=1; server new-app-2:8080 weight=1; } server { listen 80; location / { proxy_pass http://backend; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # Health check headers proxy_set_header X-Health-Check-Timeout 5s; } } ``` #### HAProxy Configuration ```haproxy backend app_servers balance roundrobin option httpchk GET /health # Old version servers server old-app-1 old-app-1:8080 check weight 80 server old-app-2 old-app-2:8080 check weight 80 # New version servers server new-app-1 new-app-1:8080 check weight 20 server new-app-2 new-app-2:8080 check weight 20 frontend app_frontend bind *:80 default_backend app_servers # Custom health check endpoint acl health_check path_beg /health http-request return status 200 content-type text/plain string "OK" if health_check ``` ### 2. Circuit Breaker Implementation ```python class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=60, expected_exception=Exception): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.expected_exception = expected_exception self.failure_count = 0 self.last_failure_time = None self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN def call(self, func, *args, **kwargs): """Execute function with circuit breaker protection""" if self.state == 'OPEN': if self._should_attempt_reset(): self.state = 'HALF_OPEN' else: raise CircuitBreakerOpenException("Circuit breaker is OPEN") try: result = func(*args, **kwargs) self._on_success() return result except self.expected_exception as e: self._on_failure() raise def _should_attempt_reset(self): return ( self.last_failure_time and time.time() - self.last_failure_time >= self.recovery_timeout ) def _on_success(self): self.failure_count = 0 self.state = 'CLOSED' def _on_failure(self): self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = 'OPEN' # Usage with service migration @CircuitBreaker(failure_threshold=3, recovery_timeout=30) def call_new_service(request): return new_service.process(request) def handle_request(request): try: return call_new_service(request) except CircuitBreakerOpenException: # Fallback to old service return old_service.process(request) ``` ## Monitoring and Validation ### 1. Health Check Implementation ```python class HealthChecker: def __init__(self): self.checks = [] def add_check(self, name, check_func, timeout=5): self.checks.append({ 'name': name, 'func': check_func, 'timeout': timeout }) async def run_checks(self): """Run all health checks and return status""" results = {} overall_status = 'healthy' for check in self.checks: try: result = await asyncio.wait_for( check['func'](), timeout=check['timeout'] ) results[check['name']] = { 'status': 'healthy', 'result': result } except asyncio.TimeoutError: results[check['name']] = { 'status': 'unhealthy', 'error': 'timeout' } overall_status = 'unhealthy' except Exception as e: results[check['name']] = { 'status': 'unhealthy', 'error': str(e) } overall_status = 'unhealthy' return { 'status': overall_status, 'checks': results, 'timestamp': datetime.utcnow().isoformat() } # Example health checks health_checker = HealthChecker() async def database_check(): """Check database connectivity""" result = await db.execute("SELECT 1") return result is not None async def external_api_check(): """Check external API availability""" response = await http_client.get("https://api.example.com/health") return response.status_code == 200 async def memory_check(): """Check memory usage""" memory_usage = psutil.virtual_memory().percent if memory_usage > 90: raise Exception(f"Memory usage too high: {memory_usage}%") return f"Memory usage: {memory_usage}%" health_checker.add_check("database", database_check) health_checker.add_check("external_api", external_api_check) health_checker.add_check("memory", memory_check) ``` ### 2. Readiness vs Liveness Probes ```yaml # Kubernetes Pod with proper health checks apiVersion: v1 kind: Pod metadata: name: app-pod spec: containers: - name: app image: myapp:2.0.0 ports: - containerPort: 8080 # Readiness probe - determines if pod should receive traffic readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 3 timeoutSeconds: 2 successThreshold: 1 failureThreshold: 3 # Liveness probe - determines if pod should be restarted livenessProbe: httpGet: path: /live port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 successThreshold: 1 failureThreshold: 3 # Startup probe - gives app time to start before other probes startupProbe: httpGet: path: /startup port: 8080 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 successThreshold: 1 failureThreshold: 30 # Allow up to 150 seconds for startup ``` ### 3. Metrics and Alerting ```python class MigrationMetrics: def __init__(self, prometheus_client): self.prometheus = prometheus_client # Define custom metrics self.migration_progress = Counter( 'migration_progress_total', 'Total migration operations completed', ['operation', 'status'] ) self.migration_duration = Histogram( 'migration_operation_duration_seconds', 'Time spent on migration operations', ['operation'] ) self.system_health = Gauge( 'system_health_score', 'Overall system health score (0-1)', ['component'] ) self.traffic_split = Gauge( 'traffic_split_percentage', 'Percentage of traffic going to each version', ['version'] ) def record_migration_step(self, operation, status, duration=None): """Record completion of a migration step""" self.migration_progress.labels(operation=operation, status=status).inc() if duration: self.migration_duration.labels(operation=operation).observe(duration) def update_health_score(self, component, score): """Update health score for a component""" self.system_health.labels(component=component).set(score) def update_traffic_split(self, version_weights): """Update traffic split metrics""" for version, weight in version_weights.items(): self.traffic_split.labels(version=version).set(weight) # Usage in migration metrics = MigrationMetrics(prometheus_client) def perform_migration_step(operation): start_time = time.time() try: # Perform migration operation result = execute_migration_operation(operation) # Record success duration = time.time() - start_time metrics.record_migration_step(operation, 'success', duration) return result except Exception as e: # Record failure duration = time.time() - start_time metrics.record_migration_step(operation, 'failure', duration) raise ``` ## Rollback Strategies ### 1. Immediate Rollback Triggers ```python class AutoRollbackSystem: def __init__(self, metrics_client, deployment_client): self.metrics = metrics_client self.deployment = deployment_client self.rollback_triggers = { 'error_rate_spike': { 'threshold': 0.05, # 5% error rate 'window': 300, # 5 minutes 'auto_rollback': True }, 'latency_increase': { 'threshold': 2.0, # 2x baseline latency 'window': 600, # 10 minutes 'auto_rollback': False # Manual confirmation required }, 'availability_drop': { 'threshold': 0.95, # Below 95% availability 'window': 120, # 2 minutes 'auto_rollback': True } } async def monitor_and_rollback(self, deployment_name): """Monitor deployment and trigger rollback if needed""" while True: for trigger_name, config in self.rollback_triggers.items(): if await self.check_trigger(trigger_name, config): if config['auto_rollback']: await self.execute_rollback(deployment_name, trigger_name) else: await self.alert_for_manual_rollback(deployment_name, trigger_name) await asyncio.sleep(30) # Check every 30 seconds async def check_trigger(self, trigger_name, config): """Check if rollback trigger condition is met""" current_value = await self.metrics.get_current_value(trigger_name) baseline_value = await self.metrics.get_baseline_value(trigger_name) if trigger_name == 'error_rate_spike': return current_value > config['threshold'] elif trigger_name == 'latency_increase': return current_value > baseline_value * config['threshold'] elif trigger_name == 'availability_drop': return current_value < config['threshold'] return False async def execute_rollback(self, deployment_name, reason): """Execute automatic rollback""" print(f"Executing automatic rollback for {deployment_name}. Reason: {reason}") # Get previous revision previous_revision = await self.deployment.get_previous_revision(deployment_name) # Perform rollback await self.deployment.rollback_to_revision(deployment_name, previous_revision) # Notify stakeholders await self.notify_rollback_executed(deployment_name, reason) ``` ### 2. Data Rollback Strategies ```sql -- Point-in-time recovery setup -- Create restore point before migration SELECT pg_create_restore_point('pre_migration_' || to_char(now(), 'YYYYMMDD_HH24MISS')); -- Rollback using point-in-time recovery -- (This would be executed on a separate recovery instance) -- recovery.conf: -- recovery_target_name = 'pre_migration_20240101_120000' -- recovery_target_action = 'promote' ``` ```python class DataRollbackManager: def __init__(self, database_client, backup_service): self.db = database_client self.backup = backup_service async def create_rollback_point(self, migration_id): """Create a rollback point before migration""" rollback_point = { 'migration_id': migration_id, 'timestamp': datetime.utcnow(), 'backup_location': None, 'schema_snapshot': None } # Create database backup backup_path = await self.backup.create_backup( f"pre_migration_{migration_id}_{int(time.time())}" ) rollback_point['backup_location'] = backup_path # Capture schema snapshot schema_snapshot = await self.capture_schema_snapshot() rollback_point['schema_snapshot'] = schema_snapshot # Store rollback point metadata await self.store_rollback_metadata(rollback_point) return rollback_point async def execute_rollback(self, migration_id): """Execute data rollback to specified point""" rollback_point = await self.get_rollback_metadata(migration_id) if not rollback_point: raise Exception(f"No rollback point found for migration {migration_id}") # Stop application traffic await self.stop_application_traffic() try: # Restore from backup await self.backup.restore_from_backup( rollback_point['backup_location'] ) # Validate data integrity await self.validate_data_integrity( rollback_point['schema_snapshot'] ) # Update application configuration await self.update_application_config(rollback_point) # Resume application traffic await self.resume_application_traffic() print(f"Data rollback completed successfully for migration {migration_id}") except Exception as e: # If rollback fails, we have a serious problem await self.escalate_rollback_failure(migration_id, str(e)) raise ``` ## Best Practices Summary ### 1. Pre-Migration Checklist - [ ] Comprehensive backup strategy in place - [ ] Rollback procedures tested in staging - [ ] Monitoring and alerting configured - [ ] Health checks implemented - [ ] Feature flags configured - [ ] Team communication plan established - [ ] Load balancer configuration prepared - [ ] Database connection pooling optimized ### 2. During Migration - [ ] Monitor key metrics continuously - [ ] Validate each phase before proceeding - [ ] Maintain detailed logs of all actions - [ ] Keep stakeholders informed of progress - [ ] Have rollback trigger ready - [ ] Monitor user experience metrics - [ ] Watch for performance degradation - [ ] Validate data consistency ### 3. Post-Migration - [ ] Continue monitoring for 24-48 hours - [ ] Validate all business processes - [ ] Update documentation - [ ] Conduct post-migration retrospective - [ ] Archive migration artifacts - [ ] Update disaster recovery procedures - [ ] Plan for legacy system decommissioning ### 4. Common Pitfalls to Avoid - Don't skip testing rollback procedures - Don't ignore performance impact - Don't rush through validation phases - Don't forget to communicate with stakeholders - Don't assume health checks are sufficient - Don't neglect data consistency validation - Don't underestimate time requirements - Don't overlook dependency impacts This comprehensive guide provides the foundation for implementing zero-downtime migrations across various system components while maintaining high availability and data integrity.