Hyperce Knowledge Base
Runbooks

Service Recovery

Runbook for recovering a crashed or unresponsive service.

When to Use

Use this runbook when a service is unresponsive, crash-looping, or failing health checks.

Diagnostic Steps

  1. Check service logs for error messages.
  2. Verify resource usage (CPU, memory, disk).
  3. Check if dependent services are healthy.
  4. Review recent deployments or config changes.

Recovery Steps

  1. Restart the service and monitor for recovery.
  2. If restart fails, rollback to the last known good deployment.
  3. If rollback fails, scale down to zero and investigate.
  4. Check for resource exhaustion and increase limits if needed.
  5. Once recovered, scale back up to normal capacity.

Escalation

If the service cannot be recovered within 30 minutes, escalate to the on-call lead and open a P0 incident.

On this page