Toil is a term from Google's SRE practice that describes manual, repetitive, automatable operational work that scales linearly with service size and provides no lasting value. Examples include manually restarting services, manually running deployment scripts, hand-editing configuration files, and manually triaging alerts that could be auto-resolved.
Google's SRE model recommends that SRE teams spend no more than 50% of their time on toil — the remaining time should go toward engineering projects that reduce toil, improve reliability, or add features. When toil exceeds 50%, it crowds out the engineering work needed to build sustainable, scalable systems.
Reducing toil involves identifying repetitive tasks, automating them (through scripts, self-healing systems, or better tooling), and measuring the reduction. Monitoring and alerting automation — such as Hyperping's automated checks, escalation policies, and status page updates — eliminates a significant category of operational toil.