Files

2026-03-12 15:17:52 +07:00

15 KiB

Raw Permalink Blame History

Hotfix Procedures

Overview

Hotfixes are emergency releases designed to address critical production issues that cannot wait for the regular release cycle. This document outlines classification, procedures, and best practices for managing hotfixes across different development workflows.

Severity Classification

P0 - Critical (Production Down)

Definition: Complete system outage, data corruption, or security breach affecting all users.

Examples:

Server crashes preventing any user access
Database corruption causing data loss
Security vulnerability being actively exploited
Payment system completely non-functional
Authentication system failure preventing all logins

Response Requirements:

Timeline: Fix deployed within 2 hours
Approval: Engineering Lead + On-call Manager (verbal approval acceptable)
Process: Emergency deployment bypassing normal gates
Communication: Immediate notification to all stakeholders
Documentation: Post-incident review required within 24 hours

Escalation:

Page on-call engineer immediately
Escalate to Engineering Lead within 15 minutes
Notify CEO/CTO if resolution exceeds 4 hours

P1 - High (Major Feature Broken)

Definition: Critical functionality broken affecting significant portion of users.

Examples:

Core user workflow completely broken
Payment processing failures affecting >50% of transactions
Search functionality returning no results
Mobile app crashes on startup
API returning 500 errors for main endpoints

Response Requirements:

Timeline: Fix deployed within 24 hours
Approval: Engineering Lead + Product Manager
Process: Expedited review and testing
Communication: Stakeholder notification within 1 hour
Documentation: Root cause analysis within 48 hours

Escalation:

Notify on-call engineer within 30 minutes
Escalate to Engineering Lead within 2 hours
Daily updates to Product/Business stakeholders

P2 - Medium (Minor Feature Issues)

Definition: Non-critical functionality issues with limited user impact.

Examples:

Cosmetic UI issues affecting user experience
Non-essential features not working properly
Performance degradation not affecting core workflows
Minor API inconsistencies
Reporting/analytics data inaccuracies

Response Requirements:

Timeline: Include in next regular release
Approval: Standard pull request review process
Process: Normal development and testing cycle
Communication: Include in regular release notes
Documentation: Standard issue tracking

Escalation:

Create ticket in normal backlog
No special escalation required
Include in release planning discussions

Hotfix Workflows by Development Model

Git Flow Hotfix Process

Branch Structure

main (v1.2.3) ← hotfix/security-patch → main (v1.2.4)
                                    → develop

Step-by-Step Process

Create Hotfix Branch

git checkout main
git pull origin main
git checkout -b hotfix/security-patch

Implement Fix

Make minimal changes addressing only the specific issue
Include tests to prevent regression
Update version number (patch increment)

# Fix the issue
git add .
git commit -m "fix: resolve SQL injection vulnerability"

# Version bump
echo "1.2.4" > VERSION
git add VERSION
git commit -m "chore: bump version to 1.2.4"

Test Fix
- Run automated test suite
- Manual testing of affected functionality
- Security review if applicable
```
# Run tests
npm test
python -m pytest

# Security scan
npm audit
bandit -r src/
```

Deploy to Staging

# Deploy hotfix branch to staging
git push origin hotfix/security-patch
# Trigger staging deployment via CI/CD

Merge to Production

# Merge to main
git checkout main
git merge --no-ff hotfix/security-patch
git tag -a v1.2.4 -m "Hotfix: Security vulnerability patch"
git push origin main --tags

# Merge back to develop
git checkout develop
git merge --no-ff hotfix/security-patch
git push origin develop

# Clean up
git branch -d hotfix/security-patch
git push origin --delete hotfix/security-patch

GitHub Flow Hotfix Process

Branch Structure

main ← hotfix/critical-fix → main (immediate deploy)

Step-by-Step Process

Create Fix Branch

git checkout main
git pull origin main
git checkout -b hotfix/payment-gateway-fix

Implement and Test

# Make the fix
git add .
git commit -m "fix(payment): resolve gateway timeout issue"
git push origin hotfix/payment-gateway-fix

Create Emergency PR

# Use GitHub CLI or web interface
gh pr create --title "HOTFIX: Payment gateway timeout" \
             --body "Critical fix for payment processing failures" \
             --reviewer engineering-team \
             --label hotfix

Deploy Branch for Testing

# Deploy branch to staging for validation
./deploy.sh hotfix/payment-gateway-fix staging
# Quick smoke tests

Emergency Merge and Deploy

# After approval, merge and deploy
gh pr merge --squash
# Automatic deployment to production via CI/CD

Trunk-based Hotfix Process

Direct Commit Approach

# For small fixes, commit directly to main
git checkout main
git pull origin main
# Make fix
git add .
git commit -m "fix: resolve memory leak in user session handling"
git push origin main
# Automatic deployment triggers

Feature Flag Rollback

# For feature-related issues, disable via feature flag
curl -X POST api/feature-flags/new-search/disable
# Verify issue resolved
# Plan proper fix for next deployment

Emergency Response Procedures

Incident Declaration Process

Detection and Assessment (0-5 minutes)
- Monitor alerts or user reports identify issue
- Assess severity using classification matrix
- Determine if hotfix is required
Team Assembly (5-10 minutes)
- Page appropriate on-call engineer
- Assemble incident response team
- Establish communication channel (Slack, Teams)
Initial Response (10-30 minutes)
- Create incident ticket/document
- Begin investigating root cause
- Implement immediate mitigations if possible
Hotfix Development (30 minutes - 2 hours)
- Create hotfix branch
- Implement minimal fix
- Test fix in isolation
Deployment (15-30 minutes)
- Deploy to staging for validation
- Deploy to production
- Monitor for successful resolution
Verification (15-30 minutes)
- Confirm issue is resolved
- Monitor system stability
- Update stakeholders

Communication Templates

P0 Initial Alert

🚨 CRITICAL INCIDENT - Production Down

Status: Investigating
Impact: Complete service outage
Affected Users: All users
Started: 2024-01-15 14:30 UTC
Incident Commander: @john.doe

Current Actions:
- Investigating root cause
- Preparing emergency fix
- Will update every 15 minutes

Status Page: https://status.ourapp.com
Incident Channel: #incident-2024-001

P0 Resolution Notice

✅ RESOLVED - Production Restored

Status: Resolved
Resolution Time: 1h 23m
Root Cause: Database connection pool exhaustion
Fix: Increased connection limits and restarted services

Timeline:
14:30 UTC - Issue detected
14:45 UTC - Root cause identified
15:20 UTC - Fix deployed
15:35 UTC - Full functionality restored

Post-incident review scheduled for tomorrow 10:00 AM.
Thank you for your patience.

P1 Status Update

⚠️ Issue Update - Payment Processing

Status: Fix deployed, monitoring
Impact: Payment failures reduced from 45% to <2%
ETA: Complete resolution within 2 hours

Actions taken:
- Deployed hotfix to address timeout issues
- Increased monitoring on payment gateway
- Contacting affected customers

Next update in 30 minutes or when resolved.

Rollback Procedures

When to Rollback

Fix doesn't resolve the issue
Fix introduces new problems
System stability is compromised
Data corruption is detected

Rollback Process

Immediate Assessment (2-5 minutes)

# Check system health
curl -f https://api.ourapp.com/health
# Review error logs
kubectl logs deployment/app --tail=100
# Check key metrics

Rollback Execution (5-15 minutes)

# Git-based rollback
git checkout main
git revert HEAD
git push origin main

# Or container-based rollback
kubectl rollout undo deployment/app

# Or load balancer switch
aws elbv2 modify-target-group --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/previous-version

Verification (5-10 minutes)

# Confirm rollback successful
# Check system health endpoints
# Verify core functionality working
# Monitor error rates and performance

Communication

🔄 ROLLBACK COMPLETE

The hotfix has been rolled back due to [reason].
System is now stable on previous version.
We are investigating the issue and will provide updates.

Testing Strategies for Hotfixes

Pre-deployment Testing

Automated Testing

# Run full test suite
npm test
pytest tests/
go test ./...

# Security scanning
npm audit --audit-level high
bandit -r src/
gosec ./...

# Integration tests
./run_integration_tests.sh

# Load testing (if performance-related)
artillery quick --count 100 --num 10 https://staging.ourapp.com

Manual Testing Checklist

Core user workflow functions correctly
Authentication and authorization working
Payment processing (if applicable)
Data integrity maintained
No new error logs or exceptions
Performance within acceptable range
Mobile app functionality (if applicable)
Third-party integrations working

Staging Validation

# Deploy to staging
./deploy.sh hotfix/critical-fix staging

# Run smoke tests
curl -f https://staging.ourapp.com/api/health
./smoke_tests.sh

# Manual verification of specific issue
# Document test results

Post-deployment Monitoring

Immediate Monitoring (First 30 minutes)

Error rate and count
Response time and latency
CPU and memory usage
Database connection counts
Key business metrics

Extended Monitoring (First 24 hours)

User activity patterns
Feature usage statistics
Customer support tickets
Performance trends
Security log analysis

Monitoring Scripts

#!/bin/bash
# monitor_hotfix.sh - Post-deployment monitoring

echo "=== Hotfix Deployment Monitoring ==="
echo "Deployment time: $(date)"
echo

# Check application health
echo "--- Application Health ---"
curl -s https://api.ourapp.com/health | jq '.'

# Check error rates
echo "--- Error Rates (last 30min) ---"
curl -s "https://api.datadog.com/api/v1/query?query=sum:application.errors{*}" \
  -H "DD-API-KEY: $DATADOG_API_KEY" | jq '.series[0].pointlist[-1][1]'

# Check response times
echo "--- Response Times ---" 
curl -s "https://api.datadog.com/api/v1/query?query=avg:application.response_time{*}" \
  -H "DD-API-KEY: $DATADOG_API_KEY" | jq '.series[0].pointlist[-1][1]'

# Check database connections
echo "--- Database Status ---"
psql -h db.ourapp.com -U readonly -c "SELECT count(*) as active_connections FROM pg_stat_activity;"

echo "=== Monitoring Complete ==="

Documentation and Learning

Incident Documentation Template

# Incident Report: [Brief Description]

## Summary
- **Incident ID:** INC-2024-001
- **Severity:** P0/P1/P2
- **Start Time:** 2024-01-15 14:30 UTC
- **End Time:** 2024-01-15 15:45 UTC
- **Duration:** 1h 15m
- **Impact:** [Description of user/business impact]

## Root Cause
[Detailed explanation of what went wrong and why]

## Timeline
| Time | Event |
|------|-------|
| 14:30 | Issue detected via monitoring alert |
| 14:35 | Incident team assembled |
| 14:45 | Root cause identified |
| 15:00 | Fix developed and tested |
| 15:20 | Fix deployed to production |
| 15:45 | Issue confirmed resolved |

## Resolution
[What was done to fix the issue]

## Lessons Learned
### What went well
- Quick detection through monitoring
- Effective team coordination
- Minimal user impact

### What could be improved
- Earlier detection possible with better alerting
- Testing could have caught this issue
- Communication could be more proactive

## Action Items
- [ ] Improve monitoring for [specific area]
- [ ] Add automated test for [specific scenario] 
- [ ] Update documentation for [specific process]
- [ ] Training on [specific topic] for team

## Prevention Measures
[How we'll prevent this from happening again]

Post-Incident Review Process

Schedule Review (within 24-48 hours)
- Involve all key participants
- Book 60-90 minute session
- Prepare incident timeline
Blameless Analysis
- Focus on systems and processes, not individuals
- Understand contributing factors
- Identify improvement opportunities
Action Plan
- Concrete, assignable tasks
- Realistic timelines
- Clear success criteria
Follow-up
- Track action item completion
- Share learnings with broader team
- Update procedures based on insights

Runbook Updates

After each hotfix, update relevant runbooks:

Add new troubleshooting steps
Update contact information
Refine escalation procedures
Document new tools or processes

Team Training

Share incident learnings in team meetings
Conduct tabletop exercises for common scenarios
Update onboarding materials with hotfix procedures
Create decision trees for severity classification

Automation Improvements

Add alerts for new failure modes
Automate manual steps where possible
Improve deployment and rollback processes
Enhance monitoring and observability

Common Pitfalls and Best Practices

Common Pitfalls

❌ Over-engineering the fix

Making broad changes instead of minimal targeted fix
Adding features while fixing bugs
Refactoring unrelated code

❌ Insufficient testing

Skipping automated tests due to time pressure
Not testing the exact scenario that caused the issue
Deploying without staging validation

❌ Poor communication

Not notifying stakeholders promptly
Unclear or infrequent status updates
Forgetting to announce resolution

❌ Inadequate monitoring

Not watching system health after deployment
Missing secondary effects of the fix
Failing to verify the issue is actually resolved

Best Practices

✅ Keep fixes minimal and focused

Address only the specific issue
Avoid scope creep or improvements
Save refactoring for regular releases

✅ Maintain clear communication

Set up dedicated incident channel
Provide regular status updates
Use clear, non-technical language for business stakeholders

✅ Test thoroughly but efficiently

Focus testing on affected functionality
Use automated tests where possible
Validate in staging before production

✅ Document everything

Maintain timeline of events
Record decisions and rationale
Share lessons learned with team

✅ Plan for rollback

Always have a rollback plan ready
Test rollback procedure in advance
Monitor closely after deployment

By following these procedures and continuously improving based on experience, teams can handle production emergencies effectively while minimizing impact and learning from each incident.

15 KiB Raw Permalink Blame History

Hotfix Procedures

Overview

Severity Classification

P0 - Critical (Production Down)

P1 - High (Major Feature Broken)

P2 - Medium (Minor Feature Issues)

Hotfix Workflows by Development Model

Git Flow Hotfix Process

Branch Structure

Step-by-Step Process

GitHub Flow Hotfix Process

Branch Structure

Step-by-Step Process

Trunk-based Hotfix Process

Direct Commit Approach

Feature Flag Rollback

Emergency Response Procedures

Incident Declaration Process

Communication Templates

P0 Initial Alert

P0 Resolution Notice

P1 Status Update

Rollback Procedures

When to Rollback

Rollback Process

Testing Strategies for Hotfixes

Pre-deployment Testing

Automated Testing

Manual Testing Checklist

Staging Validation

Post-deployment Monitoring

Immediate Monitoring (First 30 minutes)

Extended Monitoring (First 24 hours)

Monitoring Scripts

Documentation and Learning

Incident Documentation Template

Post-Incident Review Process

Knowledge Sharing

Runbook Updates

Team Training

Automation Improvements

Common Pitfalls and Best Practices

Common Pitfalls

Best Practices

15 KiB

Raw Permalink Blame History