Most engineers hear "on-call rotation" and imagine sleepless nights fixing someone else's broken code at 3 AM. That's not wrong, but it's incomplete.
Done right, on-call accelerates your engineering skills faster than any bootcamp or course. Done wrong, it burns out your best people and creates a culture of fear around production.
The difference comes down to structure, not luck.
TL;DR:
- On-call accelerates engineering skills when structured properly
- Clear ownership plus good alerts plus fair scheduling equals sustainability
- Compensation and psychological safety prevent burnout
- Track metrics to identify and fix problems systematically
- Frame as growth opportunity, not punishment duty
Why on-call exists in the first place
Your systems will break. That's not pessimism, it's physics. Complex distributed systems have emergent behaviors that no amount of testing catches beforehand.
The question becomes: who fixes it, and how fast?
Traditional operations teams created a buffer between developers and production. Developers wrote code, ops teams ran it. When things broke, Level 1 support tried basic fixes, escalated to Level 2, then Level 3, and eventually maybe someone who understood the code got involved.
This model optimized for cost, not speed. But cost optimization assumes downtime is cheap.
The 5-minute rule
Five minutes of downtime used to be acceptable. Now it triggers customer complaints on social media within seconds. Modern users expect 99.99% uptime as the baseline.
It's not. Not anymore.
- Customer expectations shifted. Five minutes of downtime used to be acceptable. Now it's a Twitter crisis.
- Systems became more interconnected. A small failure cascades into bigger failures faster.
- Revenue models changed. SaaS companies lose money every minute they're down, not just reputation. Use our downtime cost calculator to estimate your potential losses.
Critical statistic
According to ITIC's 2024 report, downtime costs exceed $300,000 per hour for 90% of enterprises. For some organizations, it reaches $5 million per hour.
The economics flipped. Paying senior engineers to be on-call costs less than losing customers to downtime. Modern on-call management practices evolved from this economic reality.
You build it, you run it
Origin story
Werner Vogels, Amazon CTO, coined this philosophy in 2006: "Giving developers operational responsibilities has greatly enhanced the quality of the services."
This shift created the "you build it, you run it" philosophy. The person who wrote the code is the person best equipped to debug it quickly.
There's a secondary benefit most people miss: skin in the game changes how you write code.
When you know you'll get paged for sloppy error handling, you write better error handling. When you know you'll debug your own database queries at midnight, you optimize them properly the first time.
On-call creates a feedback loop between development practices and operational reality. That loop makes better engineers.
What makes on-call sustainable
The horror stories about on-call usually involve the same problems:
- No clear ownership of services
- Terrible alert hygiene
- No escalation paths
- Burnout from constant interruptions
Fix these systematically and on-call becomes manageable. Start with proven incident management best practices.
Clear service ownership
Every service needs an obvious owner. Not a team, a specific group of people with names and contact information.
Use labels in Kubernetes, tags in AWS, or whatever system you have. The point is: when something breaks, you know exactly who to wake up. Consider implementing uptime monitoring for each service with clear ownership tags.
Update ownership information regularly. Teams change, people leave, services get handed off. Stale ownership information means paging the wrong person or, worse, paging nobody.
Alert hygiene
Most on-call problems stem from bad alerts. Too many alerts, too few actionable alerts, alerts that fire for problems that fix themselves. Proper DevOps alert management prevents these issues.
Alert quality checklist | Why it matters | Red flags |
---|---|---|
Every alert requires immediate human action | Prevents alert fatigue | Auto-resolving issues, informational alerts |
Every alert has a runbook | Speeds resolution | "Figure it out" instructions, missing documentation |
Every alert connects to an SLO breach | Ensures business impact | No clear customer impact, arbitrary thresholds |
Every alert gets reviewed after firing | Drives continuous improvement | Recurring false positives, no post-incident review |
If an alert fires and the on-call engineer can't do anything about it, delete the alert. If an alert fires for something that auto-resolves in five minutes, raise the threshold or change the evaluation period.
Burnout warning
Research shows 53% of developers are considering quitting due to stress and burnout. Alert fatigue is a major contributor.
Alert fatigue kills good engineers. They start ignoring pages, or they leave for companies with better operational hygiene.
Realistic response times
Response time expectations should match the actual impact of the incident. Use our SLA calculator to set appropriate targets.
Incident response time matrix | Target response | Examples |
---|---|---|
Revenue-impacting outage | 5 minutes | Payment processing down, login failures |
Customer-facing degradation | 30 minutes | Slow page loads, minor feature issues |
Internal tool issues | Next business day | Admin panel bugs, reporting delays |
Setting unrealistic response times for low-impact issues creates stress without creating value. Save the urgency for when it matters.
Escalation that actually works
Primary on-call should be able to escalate quickly to secondary on-call, then to subject matter experts, then to management if needed. Implement clear escalation policies from the start.
Escalation best practice
Google's SRE book recommends: "Escalate early and often. It's better to wake someone unnecessarily than to let an incident spiral out of control."
But escalation policies often fail because they're not tested. Run escalation drills quarterly. Make sure phone numbers work, people respond to pages, and backup systems function. Our escalation policies guide provides detailed drill scenarios.
Document escalation criteria clearly. On-call engineers shouldn't have to guess whether something deserves escalation. Give them explicit guidelines.
Scheduling that preserves sanity
On-call scheduling seems simple until you try to make it fair across timezones, vacation schedules, personal preferences, and skill levels.
TL;DR:
- Minimum 5 people for 24/7 coverage, 8 for single-site
- Choose shift patterns based on team distribution
- Automate scheduling to ensure fairness
Minimum team sizes
Team size requirements
24/7 coverage: 5+ people minimum Single-site coverage: 8+ people minimum
Math: Factor in 20% unavailability (vacation, sick days, attrition). Below these numbers, individuals get paged too frequently.
You need at least five people per site for sustainable 24/7 coverage, eight people for single-site coverage. Anything less and people burn out from frequency, not from difficulty.
This is math, not opinion. Factor in vacation time, sick days, parental leave, and normal attrition. Smaller teams create unsustainable schedules.
Shift patterns
Different patterns work for different teams:
Pattern | Best for | Pros | Cons |
---|---|---|---|
Follow-the-sun | Global teams with even distribution | No overnight pages, natural handoffs | Requires coordination across regions |
Weekly rotations | Single-region teams | Deep context retention, fewer handoffs | Full week commitment can be intense |
Daily rotations | High-incident environments | Shorter commitment periods | More context switching, daily handoffs |
Choose based on your team's timezone distribution and incident patterns. If most of your issues happen during business hours, follow-the-sun makes sense. If issues are random, weekly rotations reduce handoff overhead.
Automation for scheduling
Manual scheduling breaks down as teams grow. Use on-call scheduling tools that handle constraints automatically: vacation days, personal preferences, fair distribution of weekend coverage.
The tool should optimize for fairness, not just coverage. Track metrics like weekend shifts per person, overnight pages per person, total on-call hours per quarter.
Training and preparation
On-call training roadmap | Timeline | Activity | Outcomes |
---|---|---|---|
Week 1-2: Shadow phase | 2 weeks | Join incident calls, observe decisions, read runbooks | Understand incident flow and team dynamics |
Week 3: Assisted response | 1 week | Handle incidents with mentor guidance | Build confidence with safety net |
Week 4: Solo with backup | 1 week | Primary responder with immediate escalation available | Validate readiness for rotation |
Ongoing: Monthly drills | Monthly | Participate in Wheel of Misfortune exercises | Maintain skills, learn from new scenarios |
Throwing someone into on-call without preparation is cruel and ineffective.
Shadow phases
New team members should shadow experienced on-call engineers for at least two weeks. They join incident calls, read along with runbooks, observe decision-making under pressure.
Shadowing works better than classroom training because it shows real decision-making under real constraints. Books can't teach you how to prioritize when three things are broken simultaneously.
Runbook standards
Every alert needs a runbook. Every runbook needs the same structure. Use incident communication templates to standardize responses.
Runbook template | Purpose | Example content |
---|---|---|
What this alert means | Quick context | "Database connection pool exhausted, new queries failing" |
Immediate steps | Stop the bleeding | "1. Scale connection pool 2. Identify heavy queries 3. Kill long-runners" |
Verification | Confirm fix worked | "Check metrics dashboard, verify query queue clearing" |
Escalation criteria | When to get help | "Escalate if: still failing after 15 min, data corruption suspected" |
Keep runbooks close to the code. Store them in the same repository, update them during code reviews, version them with the application.
Practice scenarios
Run "Wheel of Misfortune" exercises monthly. Take past incidents, anonymize them, and have team members practice the response. Google's SRE guide details this practice extensively.
This builds confidence and reveals gaps in documentation or tools. It's better to discover missing access permissions during a drill than during a real incident. Review incident post-mortem best practices to maximize learning.
The human side of on-call
Technical solutions only work if people can execute them under stress.
Psychological safety
On-call engineers need to know they can escalate without judgment, ask for help without shame, and make mistakes without blame.
Create explicit policies about escalation. Make it clear that escalating too early is better than escalating too late. Reward people for asking for help when they need it.
Research from NIH shows that unpredictability of on-call work affects fatigue more than the actual workload. Clear processes reduce this unpredictability.
Compensation and recovery
On-call work outside business hours should be compensated. Either pay overtime or give compensatory time off.
More importantly, build recovery time into schedules. If someone gets paged overnight, they should be able to start late the next day or work from home.
Team dynamics
On-call rotations create natural tension between individual convenience and team coverage. Address this explicitly through team agreements about shift swapping, vacation scheduling, and workload distribution.
Regular retrospectives help surface problems before they become resentments. Ask what's working, what isn't, and what changes would help.
Metrics that matter
Track the health of your on-call system with data, not anecdotes.
Metric category | Key indicators | Warning thresholds | Action items |
---|---|---|---|
Volume metrics | Pages per person per week | >5 pages/week | Review alert thresholds, add team members |
Off-hours pages per month | >10 off-hours pages | Improve business hours coverage, review criticality | |
Time to acknowledge | >15 minutes average | Check notification systems, review schedules | |
Quality metrics | False positive rate | >20% false positives | Audit and tune alerts |
Escalation rate | >30% need escalation | Improve runbooks, add training | |
MTTR by severity | SEV1 >30 min | Review incident response process | |
Fairness metrics | Weekend coverage variance | >20% difference between people | Rebalance schedule |
Overnight page distribution | Uneven across team | Consider follow-the-sun model | |
Quarterly on-call hours | >25% variance | Audit scheduling algorithm |
Use this data to identify problems: are alerts too noisy, is the schedule unfair, are people burning out?
Making on-call a growth opportunity
The best on-call programs turn operational burden into engineering advancement.
On-call skills development matrix | Junior engineer learns | Senior engineer develops |
---|---|---|
System understanding | How services connect | Cross-system dependencies |
Debugging skills | Reading logs and metrics | Distributed tracing analysis |
Communication | Status updates | Stakeholder management |
Prioritization | Following runbooks | Making judgment calls |
Automation mindset | Using existing tools | Building new solutions |
On-call engineers learn:
- How distributed systems actually behave under load
- How to debug across service boundaries
- How to communicate under pressure
- How to prioritize when everything seems urgent
These are senior engineer skills. Frame on-call as skill development, not punishment duty.
Create dedicated project time for on-call engineers to fix the root causes of recurring issues. Don't just treat symptoms, eliminate entire classes of problems through incident response automation.
Starting or fixing your on-call program
Action plan comparison | New program steps | Fixing existing program |
---|---|---|
Phase 1 | Start with clear service ownership | Audit alerts for noise and actionability |
Phase 2 | Implement basic alerting and escalation | Survey engineers about pain points |
Phase 3 | Create minimal runbooks | Measure current metrics for baseline |
Phase 4 | Begin with business-hours coverage only | Pick biggest problem and fix systematically |
Phase 5 | Expand to 24/7 as systems mature | Iterate based on feedback and data |
On-call doesn't have to be miserable. It can be a structured way to build engineering skills while keeping systems reliable. The key is treating it as an engineering problem with engineering solutions, not an unavoidable burden.
Human beings can handle being occasionally woken up to fix important problems. They can't handle being constantly interrupted to fix unimportant ones.
Build systems that respect that distinction.
