Runbooks
Service Recovery
Runbook for recovering a crashed or unresponsive service.
When to Use
Use this runbook when a service is unresponsive, crash-looping, or failing health checks.
Diagnostic Steps
- Check service logs for error messages.
- Verify resource usage (CPU, memory, disk).
- Check if dependent services are healthy.
- Review recent deployments or config changes.
Recovery Steps
- Restart the service and monitor for recovery.
- If restart fails, rollback to the last known good deployment.
- If rollback fails, scale down to zero and investigate.
- Check for resource exhaustion and increase limits if needed.
- Once recovered, scale back up to normal capacity.
Escalation
If the service cannot be recovered within 30 minutes, escalate to the on-call lead and open a P0 incident.