15 KiB
Hotfix Procedures
Overview
Hotfixes are emergency releases designed to address critical production issues that cannot wait for the regular release cycle. This document outlines classification, procedures, and best practices for managing hotfixes across different development workflows.
Severity Classification
P0 - Critical (Production Down)
Definition: Complete system outage, data corruption, or security breach affecting all users.
Examples:
- Server crashes preventing any user access
- Database corruption causing data loss
- Security vulnerability being actively exploited
- Payment system completely non-functional
- Authentication system failure preventing all logins
Response Requirements:
- Timeline: Fix deployed within 2 hours
- Approval: Engineering Lead + On-call Manager (verbal approval acceptable)
- Process: Emergency deployment bypassing normal gates
- Communication: Immediate notification to all stakeholders
- Documentation: Post-incident review required within 24 hours
Escalation:
- Page on-call engineer immediately
- Escalate to Engineering Lead within 15 minutes
- Notify CEO/CTO if resolution exceeds 4 hours
P1 - High (Major Feature Broken)
Definition: Critical functionality broken affecting significant portion of users.
Examples:
- Core user workflow completely broken
- Payment processing failures affecting >50% of transactions
- Search functionality returning no results
- Mobile app crashes on startup
- API returning 500 errors for main endpoints
Response Requirements:
- Timeline: Fix deployed within 24 hours
- Approval: Engineering Lead + Product Manager
- Process: Expedited review and testing
- Communication: Stakeholder notification within 1 hour
- Documentation: Root cause analysis within 48 hours
Escalation:
- Notify on-call engineer within 30 minutes
- Escalate to Engineering Lead within 2 hours
- Daily updates to Product/Business stakeholders
P2 - Medium (Minor Feature Issues)
Definition: Non-critical functionality issues with limited user impact.
Examples:
- Cosmetic UI issues affecting user experience
- Non-essential features not working properly
- Performance degradation not affecting core workflows
- Minor API inconsistencies
- Reporting/analytics data inaccuracies
Response Requirements:
- Timeline: Include in next regular release
- Approval: Standard pull request review process
- Process: Normal development and testing cycle
- Communication: Include in regular release notes
- Documentation: Standard issue tracking
Escalation:
- Create ticket in normal backlog
- No special escalation required
- Include in release planning discussions
Hotfix Workflows by Development Model
Git Flow Hotfix Process
Branch Structure
main (v1.2.3) ← hotfix/security-patch → main (v1.2.4)
→ develop
Step-by-Step Process
-
Create Hotfix Branch
git checkout main git pull origin main git checkout -b hotfix/security-patch -
Implement Fix
- Make minimal changes addressing only the specific issue
- Include tests to prevent regression
- Update version number (patch increment)
# Fix the issue git add . git commit -m "fix: resolve SQL injection vulnerability" # Version bump echo "1.2.4" > VERSION git add VERSION git commit -m "chore: bump version to 1.2.4" -
Test Fix
- Run automated test suite
- Manual testing of affected functionality
- Security review if applicable
# Run tests npm test python -m pytest # Security scan npm audit bandit -r src/ -
Deploy to Staging
# Deploy hotfix branch to staging git push origin hotfix/security-patch # Trigger staging deployment via CI/CD -
Merge to Production
# Merge to main git checkout main git merge --no-ff hotfix/security-patch git tag -a v1.2.4 -m "Hotfix: Security vulnerability patch" git push origin main --tags # Merge back to develop git checkout develop git merge --no-ff hotfix/security-patch git push origin develop # Clean up git branch -d hotfix/security-patch git push origin --delete hotfix/security-patch
GitHub Flow Hotfix Process
Branch Structure
main ← hotfix/critical-fix → main (immediate deploy)
Step-by-Step Process
-
Create Fix Branch
git checkout main git pull origin main git checkout -b hotfix/payment-gateway-fix -
Implement and Test
# Make the fix git add . git commit -m "fix(payment): resolve gateway timeout issue" git push origin hotfix/payment-gateway-fix -
Create Emergency PR
# Use GitHub CLI or web interface gh pr create --title "HOTFIX: Payment gateway timeout" \ --body "Critical fix for payment processing failures" \ --reviewer engineering-team \ --label hotfix -
Deploy Branch for Testing
# Deploy branch to staging for validation ./deploy.sh hotfix/payment-gateway-fix staging # Quick smoke tests -
Emergency Merge and Deploy
# After approval, merge and deploy gh pr merge --squash # Automatic deployment to production via CI/CD
Trunk-based Hotfix Process
Direct Commit Approach
# For small fixes, commit directly to main
git checkout main
git pull origin main
# Make fix
git add .
git commit -m "fix: resolve memory leak in user session handling"
git push origin main
# Automatic deployment triggers
Feature Flag Rollback
# For feature-related issues, disable via feature flag
curl -X POST api/feature-flags/new-search/disable
# Verify issue resolved
# Plan proper fix for next deployment
Emergency Response Procedures
Incident Declaration Process
-
Detection and Assessment (0-5 minutes)
- Monitor alerts or user reports identify issue
- Assess severity using classification matrix
- Determine if hotfix is required
-
Team Assembly (5-10 minutes)
- Page appropriate on-call engineer
- Assemble incident response team
- Establish communication channel (Slack, Teams)
-
Initial Response (10-30 minutes)
- Create incident ticket/document
- Begin investigating root cause
- Implement immediate mitigations if possible
-
Hotfix Development (30 minutes - 2 hours)
- Create hotfix branch
- Implement minimal fix
- Test fix in isolation
-
Deployment (15-30 minutes)
- Deploy to staging for validation
- Deploy to production
- Monitor for successful resolution
-
Verification (15-30 minutes)
- Confirm issue is resolved
- Monitor system stability
- Update stakeholders
Communication Templates
P0 Initial Alert
🚨 CRITICAL INCIDENT - Production Down
Status: Investigating
Impact: Complete service outage
Affected Users: All users
Started: 2024-01-15 14:30 UTC
Incident Commander: @john.doe
Current Actions:
- Investigating root cause
- Preparing emergency fix
- Will update every 15 minutes
Status Page: https://status.ourapp.com
Incident Channel: #incident-2024-001
P0 Resolution Notice
✅ RESOLVED - Production Restored
Status: Resolved
Resolution Time: 1h 23m
Root Cause: Database connection pool exhaustion
Fix: Increased connection limits and restarted services
Timeline:
14:30 UTC - Issue detected
14:45 UTC - Root cause identified
15:20 UTC - Fix deployed
15:35 UTC - Full functionality restored
Post-incident review scheduled for tomorrow 10:00 AM.
Thank you for your patience.
P1 Status Update
⚠️ Issue Update - Payment Processing
Status: Fix deployed, monitoring
Impact: Payment failures reduced from 45% to <2%
ETA: Complete resolution within 2 hours
Actions taken:
- Deployed hotfix to address timeout issues
- Increased monitoring on payment gateway
- Contacting affected customers
Next update in 30 minutes or when resolved.
Rollback Procedures
When to Rollback
- Fix doesn't resolve the issue
- Fix introduces new problems
- System stability is compromised
- Data corruption is detected
Rollback Process
-
Immediate Assessment (2-5 minutes)
# Check system health curl -f https://api.ourapp.com/health # Review error logs kubectl logs deployment/app --tail=100 # Check key metrics -
Rollback Execution (5-15 minutes)
# Git-based rollback git checkout main git revert HEAD git push origin main # Or container-based rollback kubectl rollout undo deployment/app # Or load balancer switch aws elbv2 modify-target-group --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/previous-version -
Verification (5-10 minutes)
# Confirm rollback successful # Check system health endpoints # Verify core functionality working # Monitor error rates and performance -
Communication
🔄 ROLLBACK COMPLETE The hotfix has been rolled back due to [reason]. System is now stable on previous version. We are investigating the issue and will provide updates.
Testing Strategies for Hotfixes
Pre-deployment Testing
Automated Testing
# Run full test suite
npm test
pytest tests/
go test ./...
# Security scanning
npm audit --audit-level high
bandit -r src/
gosec ./...
# Integration tests
./run_integration_tests.sh
# Load testing (if performance-related)
artillery quick --count 100 --num 10 https://staging.ourapp.com
Manual Testing Checklist
- Core user workflow functions correctly
- Authentication and authorization working
- Payment processing (if applicable)
- Data integrity maintained
- No new error logs or exceptions
- Performance within acceptable range
- Mobile app functionality (if applicable)
- Third-party integrations working
Staging Validation
# Deploy to staging
./deploy.sh hotfix/critical-fix staging
# Run smoke tests
curl -f https://staging.ourapp.com/api/health
./smoke_tests.sh
# Manual verification of specific issue
# Document test results
Post-deployment Monitoring
Immediate Monitoring (First 30 minutes)
- Error rate and count
- Response time and latency
- CPU and memory usage
- Database connection counts
- Key business metrics
Extended Monitoring (First 24 hours)
- User activity patterns
- Feature usage statistics
- Customer support tickets
- Performance trends
- Security log analysis
Monitoring Scripts
#!/bin/bash
# monitor_hotfix.sh - Post-deployment monitoring
echo "=== Hotfix Deployment Monitoring ==="
echo "Deployment time: $(date)"
echo
# Check application health
echo "--- Application Health ---"
curl -s https://api.ourapp.com/health | jq '.'
# Check error rates
echo "--- Error Rates (last 30min) ---"
curl -s "https://api.datadog.com/api/v1/query?query=sum:application.errors{*}" \
-H "DD-API-KEY: $DATADOG_API_KEY" | jq '.series[0].pointlist[-1][1]'
# Check response times
echo "--- Response Times ---"
curl -s "https://api.datadog.com/api/v1/query?query=avg:application.response_time{*}" \
-H "DD-API-KEY: $DATADOG_API_KEY" | jq '.series[0].pointlist[-1][1]'
# Check database connections
echo "--- Database Status ---"
psql -h db.ourapp.com -U readonly -c "SELECT count(*) as active_connections FROM pg_stat_activity;"
echo "=== Monitoring Complete ==="
Documentation and Learning
Incident Documentation Template
# Incident Report: [Brief Description]
## Summary
- **Incident ID:** INC-2024-001
- **Severity:** P0/P1/P2
- **Start Time:** 2024-01-15 14:30 UTC
- **End Time:** 2024-01-15 15:45 UTC
- **Duration:** 1h 15m
- **Impact:** [Description of user/business impact]
## Root Cause
[Detailed explanation of what went wrong and why]
## Timeline
| Time | Event |
|------|-------|
| 14:30 | Issue detected via monitoring alert |
| 14:35 | Incident team assembled |
| 14:45 | Root cause identified |
| 15:00 | Fix developed and tested |
| 15:20 | Fix deployed to production |
| 15:45 | Issue confirmed resolved |
## Resolution
[What was done to fix the issue]
## Lessons Learned
### What went well
- Quick detection through monitoring
- Effective team coordination
- Minimal user impact
### What could be improved
- Earlier detection possible with better alerting
- Testing could have caught this issue
- Communication could be more proactive
## Action Items
- [ ] Improve monitoring for [specific area]
- [ ] Add automated test for [specific scenario]
- [ ] Update documentation for [specific process]
- [ ] Training on [specific topic] for team
## Prevention Measures
[How we'll prevent this from happening again]
Post-Incident Review Process
-
Schedule Review (within 24-48 hours)
- Involve all key participants
- Book 60-90 minute session
- Prepare incident timeline
-
Blameless Analysis
- Focus on systems and processes, not individuals
- Understand contributing factors
- Identify improvement opportunities
-
Action Plan
- Concrete, assignable tasks
- Realistic timelines
- Clear success criteria
-
Follow-up
- Track action item completion
- Share learnings with broader team
- Update procedures based on insights
Knowledge Sharing
Runbook Updates
After each hotfix, update relevant runbooks:
- Add new troubleshooting steps
- Update contact information
- Refine escalation procedures
- Document new tools or processes
Team Training
- Share incident learnings in team meetings
- Conduct tabletop exercises for common scenarios
- Update onboarding materials with hotfix procedures
- Create decision trees for severity classification
Automation Improvements
- Add alerts for new failure modes
- Automate manual steps where possible
- Improve deployment and rollback processes
- Enhance monitoring and observability
Common Pitfalls and Best Practices
Common Pitfalls
❌ Over-engineering the fix
- Making broad changes instead of minimal targeted fix
- Adding features while fixing bugs
- Refactoring unrelated code
❌ Insufficient testing
- Skipping automated tests due to time pressure
- Not testing the exact scenario that caused the issue
- Deploying without staging validation
❌ Poor communication
- Not notifying stakeholders promptly
- Unclear or infrequent status updates
- Forgetting to announce resolution
❌ Inadequate monitoring
- Not watching system health after deployment
- Missing secondary effects of the fix
- Failing to verify the issue is actually resolved
Best Practices
✅ Keep fixes minimal and focused
- Address only the specific issue
- Avoid scope creep or improvements
- Save refactoring for regular releases
✅ Maintain clear communication
- Set up dedicated incident channel
- Provide regular status updates
- Use clear, non-technical language for business stakeholders
✅ Test thoroughly but efficiently
- Focus testing on affected functionality
- Use automated tests where possible
- Validate in staging before production
✅ Document everything
- Maintain timeline of events
- Record decisions and rationale
- Share lessons learned with team
✅ Plan for rollback
- Always have a rollback plan ready
- Test rollback procedure in advance
- Monitor closely after deployment
By following these procedures and continuously improving based on experience, teams can handle production emergencies effectively while minimizing impact and learning from each incident.