Runbook

A documented set of procedures for diagnosing and resolving specific types of incidents or operational tasks.

A runbook is a collection of standardized procedures that guide an operator through diagnosing and resolving a specific type of incident or performing a routine operational task. Runbooks reduce the cognitive load on on-call engineers by providing step-by-step instructions for known failure modes.

A good runbook includes the symptoms or alert conditions that trigger its use, step-by-step diagnostic procedures, resolution steps with specific commands or actions, escalation criteria for when the runbook doesn't resolve the issue, and links to relevant dashboards and documentation.

Runbooks are especially valuable for reducing MTTR because they eliminate the need for the responder to figure out the diagnosis and resolution from scratch each time. They also enable less experienced engineers to handle on-call shifts confidently. Teams should keep runbooks up-to-date by reviewing them during postmortems and updating procedures when systems change.

Hyperping monitoring dashboard

Related Terms

Incident Management
The process of detecting, responding to, resolving, and learning from service disruptions.
On-Call
A rotation system where team members are designated to respond to alerts and incidents outside norma...
MTTR (Mean Time to Recover)
The average time it takes to restore a system or service after a failure or incident.
Post-Mortem (Incident Review)
A structured review conducted after an incident to identify root causes and prevent recurrence.
Toil
Repetitive, manual operational work that scales linearly with service growth and can be automated.

Related Resources

Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

14 days free trial — No card required