Setting Check Intervals and Thresholds

Choosing the right check frequency

Check frequency determines how quickly you detect an outage. A 30-second interval means you learn about downtime within 30 seconds. A 5-minute interval means up to 5 minutes of undetected downtime.

The tradeoff is straightforward: shorter intervals give faster detection but generate more data and, if misconfigured, more noise.

30-second checks

Use 30-second intervals for services where every minute of downtime has measurable business impact. Payment processing endpoints, authentication APIs, and checkout flows all fall into this category. If your SLA promises 99.99% uptime, you have less than 53 minutes of allowed downtime per year. Detecting an outage 5 minutes late could consume a significant chunk of that budget.

1-minute checks

One-minute intervals work well for most production services. They provide a good balance between detection speed and resource usage. Primary web applications, customer-facing APIs, and internal services that other teams depend on typically justify 1-minute checks.

5-minute checks

Five-minute checks suit services where brief interruptions are tolerable: staging environments, internal documentation sites, development tooling, and lower-priority marketing pages. They are also appropriate for monitoring third-party dependencies where you have no control over resolution time anyway.

Multi-region confirmation

A single failed check does not mean your service is down. The monitoring server itself might experience a network blip, a regional ISP could have a routing issue, or a transient packet loss event could cause one check to time out while the service is perfectly healthy.

Multi-region confirmation solves this problem. When a check fails from one location, Hyperping immediately retries from additional monitoring regions before raising an alert. Only when multiple independent locations confirm the failure does the system notify your team.

This approach significantly reduces false positives. Without multi-region confirmation, teams running 30-second checks on dozens of monitors will receive several false alerts per week. With it, false alerts drop to near zero.

Configuring confirmation regions

For most services, confirming from 2-3 regions is enough. If your service operates in a single geographic market (say, only North America), you might confirm from two North American locations. For globally distributed services, use regions on different continents.

The tradeoff is alert speed: each confirmation region adds a few seconds of delay to the alert. A check that fails in Frankfurt and gets confirmed from London and New York might take 10-15 seconds before the alert fires. For most teams, this delay is worth the reduction in false alarms.

Response time thresholds

Monitoring is not just about up-or-down. A service that responds in 8 seconds is technically available but practically broken for users. Response time thresholds let you define performance boundaries and get alerted when your service degrades.

Setting the right threshold

Start by measuring your baseline response times over a normal week. If your API typically responds in 200ms, a threshold of 2 seconds catches genuine degradation without firing on normal variance. Setting it at 300ms would trigger alerts every time a garbage collection pause or a slow database query adds a few hundred milliseconds.

A reasonable approach:

Warning threshold: 2-3x your typical response time. Useful for early investigation.
Critical threshold: 5-10x your typical response time, or the point where users noticeably suffer.

Consecutive failure requirements

Requiring multiple consecutive failures before alerting adds another layer of false positive protection. A single slow response might be a one-off. Three consecutive slow responses indicate a real problem.

Most monitoring configurations use 2-3 consecutive failures as the alert trigger. This adds a small delay (if you run 1-minute checks and require 3 failures, you will be alerted after ~3 minutes) but drastically reduces noise.

Timeout configuration

Timeouts define how long the monitoring system waits for a response before marking a check as failed. Setting this correctly prevents two problems: timeouts that are too short cause false failures on legitimately slow responses, while timeouts that are too long delay outage detection.

For HTTP checks on web applications, a timeout of 10-30 seconds covers most scenarios. API endpoints that should respond quickly might use a 5-second timeout. Health check endpoints, which typically return a simple JSON response, can use even shorter timeouts of 3-5 seconds.

If a service legitimately takes 20 seconds to respond, you have a performance problem, not a monitoring configuration problem. Fix the service first.

Recommended starting configuration

For a standard production web application, start with:

1-minute check interval
Multi-region confirmation from 2 regions
10-second timeout
2 consecutive failures before alerting
Response time warning at 2 seconds, critical at 5 seconds

Adjust from there based on your SLA commitments and how much noise your team can tolerate. The next chapter covers what happens after a check fails: alerting and escalation policies.

← Choosing the Right Monitoring Checks

Alerting and Escalation Policies →

Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

Talk to sales

14 days free trial. No card required.