SOPs
Incident Response
Standard operating procedure for handling production incidents.
Overview
This document outlines the steps to follow when a production incident occurs.
Severity Levels
| Level | Description | Response Time |
|---|---|---|
| P0 | Service down, all users affected | Immediate |
| P1 | Major feature broken, many users affected | Within 15 minutes |
| P2 | Minor feature broken, some users affected | Within 1 hour |
| P3 | Cosmetic or low-impact issue | Next business day |
Response Steps
- Acknowledge the alert and join the incident channel.
- Assess the severity level based on the table above.
- Communicate status to stakeholders.
- Investigate root cause using logs and monitoring dashboards.
- Mitigate the issue (rollback, hotfix, or feature flag).
- Resolve and confirm the fix is deployed.
- Post-mortem within 48 hours for P0/P1 incidents.