Measuring and Improving with SLA & DORA Metrics

From monitoring to measurement

The previous chapters covered how to detect, respond to, and communicate about incidents. This chapter focuses on the feedback loop: how to measure your reliability over time and use those measurements to improve.

Two frameworks give you the metrics you need. The SLA/SLO/SLI framework defines your reliability targets and tracks whether you meet them. DORA metrics measure how well your engineering organization delivers and maintains software.

SLA, SLO, and SLI

These three terms are related but distinct, and mixing them up leads to confusion.

SLI (Service Level Indicator)

An SLI is a specific, measurable metric that reflects the health of your service. Common SLIs include:

Availability: The percentage of successful requests over a time window (e.g., 99.95% of requests returned a non-5xx status code)
Latency: The percentage of requests served within a time threshold (e.g., 95% of requests completed in under 300ms)
Error rate: The proportion of requests that returned errors

SLIs are facts. They come directly from your monitoring data and tell you exactly how your service performed.

SLO (Service Level Objective)

An SLO is an internal target for a specific SLI. If your availability SLI shows 99.97% over the past month, and your SLO is 99.95%, you are meeting your objective.

SLOs are set by your team based on user expectations, technical constraints, and business priorities. They should be ambitious enough to keep users satisfied but realistic enough that your team can consistently meet them without heroic effort.

Good SLOs share a few properties:

They are tied to user-visible behavior, not internal infrastructure metrics
They have a defined time window (monthly, quarterly)
They are specific enough to be measured automatically
They are reviewed and adjusted periodically

SLA (Service Level Agreement)

An SLA is a contractual commitment to your customers. It defines the minimum service level you promise to deliver and the consequences (typically financial credits) if you fail.

SLAs should always be less strict than your internal SLOs. If your SLO is 99.95% availability, your SLA might promise 99.9%. This buffer gives your team room to address degradations before they become contractual violations.

Hyperping's SLA calculator helps you understand exactly how much downtime different availability targets allow per month and per year.

Error budgets

An error budget is the gap between perfect reliability (100%) and your SLO. If your SLO is 99.95% availability over 30 days, your error budget is 0.05%, which translates to about 21 minutes of allowed downtime per month.

Error budgets change the conversation about reliability. Instead of debating whether a deployment is "risky," you check the error budget. If you have 18 minutes of budget remaining this month, your team might delay a risky database migration until next month. If you have used only 2 minutes, there is room for more aggressive changes.

This framework creates a natural balance between velocity and reliability:

Budget remaining: Ship features, take calculated risks, run experiments
Budget depleted: Prioritize reliability work, slow down deployments, investigate recurring issues

Track error budget consumption over time. A team that consistently burns through its budget early in the month has a systemic reliability problem. A team that rarely touches its budget might have an SLO that is too lenient.

DORA metrics

The DORA (DevOps Research and Assessment) metrics measure software delivery performance. Research across thousands of organizations has shown that these four metrics correlate strongly with both organizational performance and operational reliability.

Deployment frequency

How often your team deploys to production. High-performing teams deploy on demand, often multiple times per day. Lower-performing teams deploy weekly or monthly.

Frequent deployments mean smaller changesets, which are easier to review, test, and roll back if something goes wrong. A team that deploys once a month ships large, risky batches. A team that deploys 10 times a day ships small, low-risk increments.

Lead time for changes

The time from code commit to code running in production. This includes code review, CI/CD pipeline execution, staging validation, and deployment.

Short lead times mean your team can fix problems quickly. If a customer reports a bug and the fix takes 15 minutes to write, but 3 days to reach production, your effective response time is 3 days.

Change failure rate

The percentage of deployments that cause a failure in production (an outage, a rollback, or a hotfix). High-performing teams keep this below 15%.

A high change failure rate indicates gaps in testing, code review, or deployment practices. Track which types of changes fail most often and invest in preventing those specific failure modes.

MTTR (Mean Time to Recovery)

The average time from when a failure is detected to when the service is restored. This metric encompasses your entire incident response pipeline: detection, alerting, acknowledgment, investigation, and resolution.

MTTR is where all the practices from this guide come together. Fast detection (Chapter 1), reliable alerting (Chapter 4), effective on-call (Chapter 5), and structured incident response (Chapter 6) all reduce your MTTR.

High-performing teams recover from failures in under one hour. Lower-performing teams may take days.

Continuous improvement through postmortems

Metrics tell you where you stand. Postmortems tell you how to get better.

After every significant incident, your postmortem should produce concrete action items that address the root cause. These action items typically fall into a few categories:

Monitoring improvements: Adding checks that would have detected the issue sooner
Automation: Building automated rollback, failover, or remediation for known failure modes
Testing: Adding test coverage for the scenario that caused the failure
Process changes: Updating runbooks, escalation policies, or communication templates
Architecture changes: Reducing single points of failure, adding redundancy, or improving isolation between services

Track these action items and measure completion rates. A postmortem process that generates action items but never completes them provides little value. Review open action items weekly and treat them with the same priority as feature work.

Putting metrics to work

Start by defining SLIs for your most important services. Set initial SLOs based on your current performance data, giving yourself a small buffer. Calculate your error budgets and share them with the team.

Add DORA metrics to your team dashboard. Review all metrics monthly and look for trends. If MTTR is increasing, investigate whether the problem is in detection, alerting, or response. If change failure rate is climbing, look at your testing and review practices.

Reliability is not a destination. It is a practice that improves incrementally through measurement, feedback, and disciplined follow-through on the lessons your incidents teach you.

← Status Pages and Incident Communication

Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

Talk to sales

14 days free trial. No card required.