When an alert fires at 3 AM, someone specific needs to be responsible for responding. Without a clear on-call schedule, alerts get ignored, response times balloon, and the same few people end up handling every incident because they feel personally responsible.
A well-designed on-call schedule distributes this responsibility fairly across the team, sets clear expectations about who responds when, and gives everyone predictable time off.
In a weekly rotation, one engineer is on-call for an entire week before handing off to the next person. This is the most common pattern and works well for teams of 4-8 engineers.
The advantage of weekly rotations is simplicity. Everyone knows who is on-call by checking the schedule. Handoffs happen at a predictable time (typically Monday morning), and the on-call person has a full week to build context on any ongoing issues.
The downside: a week of on-call can be draining, especially if it includes a weekend and the service is noisy. Teams with high alert volumes may find that weekly rotations lead to burnout faster than shorter cycles.
Daily rotations assign on-call duty for a single day at a time. This reduces the burden on any individual day but increases the frequency of handoffs. Each handoff is a potential point of lost context if the previous on-call was investigating an ongoing issue.
Daily rotations work best for larger teams (8+ engineers) where the reduced per-person burden outweighs the handoff overhead.
Follow-the-sun rotations assign on-call based on geographic timezone. The team in Europe handles alerts during European business hours, then hands off to the US team, which later hands off to an Asia-Pacific team.
This pattern eliminates nighttime on-call entirely for each region. Nobody gets woken up at 3 AM because it is always business hours somewhere in the rotation. The tradeoff is that you need team members in at least 2-3 timezones, and handoffs between regions require clear communication about ongoing incidents.
Most on-call schedules assign two roles per shift:
The primary responder receives the initial alert and is expected to acknowledge and begin investigating within the defined response window (typically 5-15 minutes for critical incidents).
The secondary responder acts as a backup. If the primary does not acknowledge the alert within the escalation window, the alert automatically routes to the secondary. The secondary also provides support during complex incidents that require more than one person.
Having a secondary responder prevents single points of failure in your incident response. If the primary is in an area without cell service, dealing with a personal emergency, or already handling another incident, the secondary picks up the slack.
Hyperping's escalation policies let you configure primary and secondary responders with automatic escalation timelines.
Schedules need flexibility. People take vacations, get sick, attend conferences, or occasionally need to swap shifts. A good on-call system supports overrides: temporary changes to the schedule for a specific time window.
When an engineer requests time off, they should be able to offer their shift to a teammate or have a manager reassign it. The override should be visible in the schedule so everyone knows who is actually responsible during that window.
Track overrides in the same tool that manages the schedule. Relying on informal Slack messages to communicate on-call swaps leads to gaps where nobody is covering.
On-call burnout is a real retention risk. Engineers who feel trapped in a relentless cycle of nighttime pages and weekend interruptions will leave. Several practices help:
Limit on-call frequency. No one should be on-call more than one week out of every four. If your team is too small to achieve this, that is a staffing signal, not an argument for overloading the existing team.
Reduce alert noise. Every false positive and low-priority alert that pages the on-call person during off-hours erodes goodwill. Invest in tuning thresholds and eliminating noisy monitors (see Chapter 3 and Chapter 4).
Provide compensatory time. If someone handles a 2 AM incident that takes two hours to resolve, they should not be expected to start their normal workday at 9 AM. Give explicit permission and encouragement to take comp time after disruptive on-call events.
Compensate on-call time. Whether through additional pay, extra PTO, or other benefits, acknowledging that on-call work is real work outside normal hours goes a long way toward maintaining team morale.
Conduct regular reviews. Review on-call load monthly. Look at how many pages each person received, how many were actionable, and how many were during off-hours. Use this data to identify improvements in monitoring configuration and staffing.
The next chapter covers what to do when an alert triggers a real incident: the incident response workflow.