A runbook is a collection of standardized procedures that guide an operator through diagnosing and resolving a specific type of incident or performing a routine operational task. Runbooks reduce the cognitive load on on-call engineers by providing step-by-step instructions for known failure modes.
A good runbook includes the symptoms or alert conditions that trigger its use, step-by-step diagnostic procedures, resolution steps with specific commands or actions, escalation criteria for when the runbook doesn't resolve the issue, and links to relevant dashboards and documentation.
Runbooks are especially valuable for reducing MTTR because they eliminate the need for the responder to figure out the diagnosis and resolution from scratch each time. They also enable less experienced engineers to handle on-call shifts confidently. Teams should keep runbooks up-to-date by reviewing them during postmortems and updating procedures when systems change.