MTBF, or Mean Time Between Failures, measures the average elapsed time between one failure and the next for a system that is repaired and returned to service. It is a key reliability metric for repairable systems like web services, databases, and infrastructure.
MTBF is calculated as the total uptime divided by the number of failures. For example, if a service ran for 720 hours in a month and experienced 2 failures, the MTBF would be 360 hours. MTBF = MTTF + MTTR, meaning it accounts for both operational time and repair time.
A higher MTBF indicates a more reliable system. Teams improve MTBF by investing in redundancy, implementing chaos engineering practices, using canary deployments to catch issues early, and monitoring proactively with tools like Hyperping to detect degradation before it becomes a full outage.