Check frequency determines how quickly you detect an outage. A 30-second interval means you learn about downtime within 30 seconds. A 5-minute interval means up to 5 minutes of undetected downtime.
The tradeoff is straightforward: shorter intervals give faster detection but generate more data and, if misconfigured, more noise.
Use 30-second intervals for services where every minute of downtime has measurable business impact. Payment processing endpoints, authentication APIs, and checkout flows all fall into this category. If your SLA promises 99.99% uptime, you have less than 53 minutes of allowed downtime per year. Detecting an outage 5 minutes late could consume a significant chunk of that budget.
One-minute intervals work well for most production services. They provide a good balance between detection speed and resource usage. Primary web applications, customer-facing APIs, and internal services that other teams depend on typically justify 1-minute checks.
Five-minute checks suit services where brief interruptions are tolerable: staging environments, internal documentation sites, development tooling, and lower-priority marketing pages. They are also appropriate for monitoring third-party dependencies where you have no control over resolution time anyway.
A single failed check does not mean your service is down. The monitoring server itself might experience a network blip, a regional ISP could have a routing issue, or a transient packet loss event could cause one check to time out while the service is perfectly healthy.
Multi-region confirmation solves this problem. When a check fails from one location, Hyperping immediately retries from additional monitoring regions before raising an alert. Only when multiple independent locations confirm the failure does the system notify your team.
This approach significantly reduces false positives. Without multi-region confirmation, teams running 30-second checks on dozens of monitors will receive several false alerts per week. With it, false alerts drop to near zero.
For most services, confirming from 2-3 regions is enough. If your service operates in a single geographic market (say, only North America), you might confirm from two North American locations. For globally distributed services, use regions on different continents.
The tradeoff is alert speed: each confirmation region adds a few seconds of delay to the alert. A check that fails in Frankfurt and gets confirmed from London and New York might take 10-15 seconds before the alert fires. For most teams, this delay is worth the reduction in false alarms.
Monitoring is not just about up-or-down. A service that responds in 8 seconds is technically available but practically broken for users. Response time thresholds let you define performance boundaries and get alerted when your service degrades.
Start by measuring your baseline response times over a normal week. If your API typically responds in 200ms, a threshold of 2 seconds catches genuine degradation without firing on normal variance. Setting it at 300ms would trigger alerts every time a garbage collection pause or a slow database query adds a few hundred milliseconds.
A reasonable approach:
Requiring multiple consecutive failures before alerting adds another layer of false positive protection. A single slow response might be a one-off. Three consecutive slow responses indicate a real problem.
Most monitoring configurations use 2-3 consecutive failures as the alert trigger. This adds a small delay (if you run 1-minute checks and require 3 failures, you will be alerted after ~3 minutes) but drastically reduces noise.
Timeouts define how long the monitoring system waits for a response before marking a check as failed. Setting this correctly prevents two problems: timeouts that are too short cause false failures on legitimately slow responses, while timeouts that are too long delay outage detection.
For HTTP checks on web applications, a timeout of 10-30 seconds covers most scenarios. API endpoints that should respond quickly might use a 5-second timeout. Health check endpoints, which typically return a simple JSON response, can use even shorter timeouts of 3-5 seconds.
If a service legitimately takes 20 seconds to respond, you have a performance problem, not a monitoring configuration problem. Fix the service first.
For a standard production web application, start with:
Adjust from there based on your SLA commitments and how much noise your team can tolerate. The next chapter covers what happens after a check fails: alerting and escalation policies.