Toil

Repetitive, manual operational work that scales linearly with service growth and can be automated.

Toil is a term from Google's SRE practice that describes manual, repetitive, automatable operational work that scales linearly with service size and provides no lasting value. Examples include manually restarting services, manually running deployment scripts, hand-editing configuration files, and manually triaging alerts that could be auto-resolved.

Google's SRE model recommends that SRE teams spend no more than 50% of their time on toil — the remaining time should go toward engineering projects that reduce toil, improve reliability, or add features. When toil exceeds 50%, it crowds out the engineering work needed to build sustainable, scalable systems.

Reducing toil involves identifying repetitive tasks, automating them (through scripts, self-healing systems, or better tooling), and measuring the reduction. Monitoring and alerting automation — such as Hyperping's automated checks, escalation policies, and status page updates — eliminates a significant category of operational toil.

Hyperping monitoring dashboard

Related Terms

SRE (Site Reliability Engineering)
A discipline that applies software engineering practices to infrastructure and operations to build s...
DevOps
A culture and set of practices that unify software development and IT operations to deliver software...
Runbook
A documented set of procedures for diagnosing and resolving specific types of incidents or operation...
Incident Management
The process of detecting, responding to, resolving, and learning from service disruptions.

Related Resources

Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

14 days free trial — No card required