Alerting and Escalation Policies

Getting alerts to the right people at the right time

Detection without notification is useless. Your monitoring system might know your API went down 30 seconds ago, but if that information sits in a dashboard nobody is watching, you have the same outcome as having no monitoring at all.

Good alerting answers three questions: Who should know? How urgently? And through which channel?

Alert channels

Different situations call for different delivery methods. A minor performance degradation does not need a phone call at 2 AM, and a complete production outage should not rely on an email that gets read three hours later.

Email

Email works for low-urgency notifications and daily or weekly summary reports. It is the weakest alert channel because most people do not check email in real time. Use email for informational alerts like SSL certificates expiring in 30 days or scheduled maintenance reminders.

SMS

Text messages reach people faster than email and work even when the recipient is away from their computer. SMS is a good secondary channel for urgent alerts. Keep SMS messages short and include the essential information: what failed, when, and a link to the incident.

Push notifications

Mobile push notifications through a monitoring app provide a balance between urgency and convenience. They are less intrusive than phone calls but more immediate than email. Push notifications work well as the primary channel during business hours.

Phone calls

Phone call alerts are the last resort for critical incidents. A ringing phone wakes people up and demands immediate attention. Reserve phone calls for situations that require human intervention within minutes: complete service outages, security incidents, and SLA-threatening events.

Slack and team messaging

Integrating alerts into Slack or Microsoft Teams channels keeps the whole team informed and creates a space for real-time incident coordination. Channel alerts work well for visibility, but they should not be the only notification method for critical issues. Messages in busy channels get buried.

Webhooks

Webhooks let you pipe alerts into any system: PagerDuty, Opsgenie, custom dashboards, ticketing systems, or internal automation pipelines. For teams with existing incident management workflows, webhooks are often the primary integration point.

Designing escalation policies

An escalation policy defines what happens when an alert goes unacknowledged. If the first responder does not acknowledge the alert within a defined window, the system escalates to the next person or team.

Escalation tiers

A typical three-tier escalation policy:

  1. Tier 1: The on-call engineer receives an SMS and push notification. They have 5 minutes to acknowledge.
  2. Tier 2: If unacknowledged, the alert escalates to the secondary on-call engineer and the team lead via SMS and phone call. They have 10 minutes.
  3. Tier 3: If still unacknowledged, the engineering manager and VP of Engineering receive a phone call. At this point, the incident is treated as a high-severity event.

The acknowledgment window at each tier depends on the expected response time and the severity of the incident. Five minutes is reasonable for a critical production outage at Tier 1. Fifteen minutes might work for a degraded-but-functional service.

Severity levels

Not every alert deserves the same escalation path. Define severity levels that map to your business impact:

SeverityCriteriaResponse targetExample
CriticalComplete outage, data loss risk5 minutesPrimary database unreachable
HighMajor feature broken, significant user impact15 minutesPayment processing failing
MediumDegraded performance, partial functionality1 hourResponse times 3x above normal
LowMinor issue, no immediate user impactNext business dayStaging environment down

Map each severity level to specific alert channels and escalation timelines. Critical alerts get phone calls and aggressive escalation. Low-severity alerts get a Slack message and an email.

Alert grouping and deduplication

When a major outage takes down 15 monitors simultaneously, your team does not need 15 separate alert notifications. Alert grouping consolidates related alerts into a single incident notification.

Group alerts by:

  • Service or project: All monitors for the same application grouped together
  • Root cause: A DNS failure affecting multiple endpoints treated as one incident
  • Time window: Alerts firing within the same 60-second window merged into a single notification

Deduplication prevents repeated notifications for the same ongoing issue. If a service stays down for 30 minutes, the on-call engineer should receive one alert (with optional periodic reminders), not 30 separate notifications.

Avoiding alert fatigue

Alert fatigue happens when teams receive so many notifications that they start ignoring them. This is one of the most dangerous patterns in operations: the team that ignores alerts because they are usually false positives is the team that misses a real outage.

To prevent alert fatigue:

  • Tune your monitoring thresholds regularly (see Chapter 3)
  • Treat every false positive as a bug in your monitoring configuration
  • Review alert volume monthly and look for monitors that trigger frequently without requiring action
  • Use severity levels strictly, so critical alerts remain rare and meaningful
  • Route low-priority alerts to dashboards or digest emails, not real-time channels

Hyperping's escalation policies let you configure multi-tier notification flows, ensuring alerts reach the right person through the right channel at each stage.

The next chapter covers how to structure on-call schedules so your team can respond to these alerts effectively.

Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

14 days free trial — No card required