The previous chapters covered how to detect, respond to, and communicate about incidents. This chapter focuses on the feedback loop: how to measure your reliability over time and use those measurements to improve.
Two frameworks give you the metrics you need. The SLA/SLO/SLI framework defines your reliability targets and tracks whether you meet them. DORA metrics measure how well your engineering organization delivers and maintains software.
These three terms are related but distinct, and mixing them up leads to confusion.
An SLI is a specific, measurable metric that reflects the health of your service. Common SLIs include:
SLIs are facts. They come directly from your monitoring data and tell you exactly how your service performed.
An SLO is an internal target for a specific SLI. If your availability SLI shows 99.97% over the past month, and your SLO is 99.95%, you are meeting your objective.
SLOs are set by your team based on user expectations, technical constraints, and business priorities. They should be ambitious enough to keep users satisfied but realistic enough that your team can consistently meet them without heroic effort.
Good SLOs share a few properties:
An SLA is a contractual commitment to your customers. It defines the minimum service level you promise to deliver and the consequences (typically financial credits) if you fail.
SLAs should always be less strict than your internal SLOs. If your SLO is 99.95% availability, your SLA might promise 99.9%. This buffer gives your team room to address degradations before they become contractual violations.
Hyperping's SLA calculator helps you understand exactly how much downtime different availability targets allow per month and per year.
An error budget is the gap between perfect reliability (100%) and your SLO. If your SLO is 99.95% availability over 30 days, your error budget is 0.05%, which translates to about 21 minutes of allowed downtime per month.
Error budgets change the conversation about reliability. Instead of debating whether a deployment is "risky," you check the error budget. If you have 18 minutes of budget remaining this month, your team might delay a risky database migration until next month. If you have used only 2 minutes, there is room for more aggressive changes.
This framework creates a natural balance between velocity and reliability:
Track error budget consumption over time. A team that consistently burns through its budget early in the month has a systemic reliability problem. A team that rarely touches its budget might have an SLO that is too lenient.
The DORA (DevOps Research and Assessment) metrics measure software delivery performance. Research across thousands of organizations has shown that these four metrics correlate strongly with both organizational performance and operational reliability.
How often your team deploys to production. High-performing teams deploy on demand, often multiple times per day. Lower-performing teams deploy weekly or monthly.
Frequent deployments mean smaller changesets, which are easier to review, test, and roll back if something goes wrong. A team that deploys once a month ships large, risky batches. A team that deploys 10 times a day ships small, low-risk increments.
The time from code commit to code running in production. This includes code review, CI/CD pipeline execution, staging validation, and deployment.
Short lead times mean your team can fix problems quickly. If a customer reports a bug and the fix takes 15 minutes to write, but 3 days to reach production, your effective response time is 3 days.
The percentage of deployments that cause a failure in production (an outage, a rollback, or a hotfix). High-performing teams keep this below 15%.
A high change failure rate indicates gaps in testing, code review, or deployment practices. Track which types of changes fail most often and invest in preventing those specific failure modes.
The average time from when a failure is detected to when the service is restored. This metric encompasses your entire incident response pipeline: detection, alerting, acknowledgment, investigation, and resolution.
MTTR is where all the practices from this guide come together. Fast detection (Chapter 1), reliable alerting (Chapter 4), effective on-call (Chapter 5), and structured incident response (Chapter 6) all reduce your MTTR.
High-performing teams recover from failures in under one hour. Lower-performing teams may take days.
Metrics tell you where you stand. Postmortems tell you how to get better.
After every significant incident, your postmortem should produce concrete action items that address the root cause. These action items typically fall into a few categories:
Track these action items and measure completion rates. A postmortem process that generates action items but never completes them provides little value. Review open action items weekly and treat them with the same priority as feature work.
Start by defining SLIs for your most important services. Set initial SLOs based on your current performance data, giving yourself a small buffer. Calculate your error budgets and share them with the team.
Add DORA metrics to your team dashboard. Review all metrics monthly and look for trends. If MTTR is increasing, investigate whether the problem is in detection, alerting, or response. If change failure rate is climbing, look at your testing and review practices.
Reliability is not a destination. It is a practice that improves incrementally through measurement, feedback, and disciplined follow-through on the lessons your incidents teach you.