Incident Response Workflow

A structured approach to incidents

When a production incident hits, the natural instinct is to scramble. Everyone jumps in, people duplicate effort, communication breaks down, and the resolution takes longer than it should. A defined incident response workflow prevents this by giving every team member a clear role and a predictable sequence of steps.

The workflow follows six phases: detection, acknowledgment, investigation, resolution, communication, and postmortem.

Phase 1: Detection

Detection is the starting point. Your monitoring system identifies an anomaly, confirms it from multiple regions, and triggers an alert through your escalation policy.

Automated detection is the goal, but not every incident gets caught by monitors. Customer reports, support tickets, and internal team observations also surface problems. Regardless of the source, every reported issue should follow the same workflow from this point forward.

Hyperping's incident manager provides a centralized place to track incidents from the moment they are detected.

Phase 2: Acknowledgment

The on-call responder receives the alert and acknowledges it. Acknowledgment tells the rest of the team that someone is actively looking at the problem and stops the alert from escalating further.

Acknowledgment should happen within your defined response window, typically 5-15 minutes for critical incidents. If the on-call responder does not acknowledge, the alert escalates to the secondary responder (see Chapter 5: On-Call Scheduling).

At this stage, the responder makes an initial severity assessment:

SeverityImpactExamples
CriticalComplete outage or data integrity riskDatabase unreachable, payment processing down
HighMajor feature broken, large user impactLogin failing for 50%+ of users
MediumDegraded performance, partial impactElevated latency, one region affected
LowMinor issue, limited user impactNon-critical background job failing

Critical and high-severity incidents trigger the full response workflow. Medium and low incidents might be handled by the on-call engineer alone.

Phase 3: Investigation

For critical incidents, assemble the response team and assign roles:

Incident Commander (IC)

The IC owns the incident from start to finish. They coordinate the response, make decisions about next steps, and track the overall progress. The IC does not need to be the most senior engineer. They need to be organized, calm under pressure, and able to delegate effectively.

The IC's responsibilities:

  • Declare the incident severity and scope
  • Assign roles to the response team
  • Decide on action items and priorities
  • Authorize risky changes (rollbacks, failovers)
  • Call for additional help if needed

Communications Lead

The comms lead handles all external and internal communication. They update the status page, post to internal Slack channels, notify customer support, and draft customer-facing messages. Separating communication from investigation lets the technical responders focus on fixing the problem.

Technical Responders

The engineers doing the actual debugging and remediation. They investigate logs, metrics, and traces, test hypotheses, and implement fixes. They report their findings and progress to the IC.

Investigation process

Follow a structured troubleshooting approach:

  1. Identify what changed. Check recent deployments, configuration changes, infrastructure updates, and third-party service statuses. Most incidents correlate with a recent change.
  2. Narrow the blast radius. Determine which services, regions, or user segments are affected. This focuses the investigation and helps prioritize the response.
  3. Check your runbooks. If you have a runbook for this type of failure, follow it. Runbooks codify the team's collective knowledge and prevent re-learning the same troubleshooting steps under pressure.
  4. Collect evidence. Gather logs, metrics, error rates, and screenshots. This evidence is valuable both for resolving the current incident and for the postmortem.

Phase 4: Resolution

Resolution means the service is restored to normal operation. This might involve:

  • Rolling back a deployment
  • Applying a hotfix
  • Failing over to a backup system
  • Scaling infrastructure to handle load
  • Contacting a third-party provider to resolve an upstream issue

Once the fix is in place, verify recovery by checking your monitoring dashboards and confirming that alerts have cleared. Do not declare an incident resolved until your monitors confirm the service is healthy from all regions.

If a temporary workaround restores service but the root cause is not yet fixed, document the workaround and create a follow-up ticket for the permanent fix.

Phase 5: Communication

Communication runs in parallel with every phase, but it deserves special attention as its own step. During the incident:

  • Update your status page with the current state and estimated resolution time
  • Notify affected customers through status page subscriptions
  • Post regular updates to internal channels, even if the update is "still investigating"
  • Inform customer support so they can handle incoming tickets

See Chapter 7: Status Pages and Incident Communication for detailed guidance on what to communicate and how often.

Phase 6: Postmortem

Every critical and high-severity incident should have a postmortem. Schedule it within 48 hours of resolution while the details are fresh.

A good postmortem answers:

  • What happened? A factual timeline of events from detection to resolution.
  • Why did it happen? Root cause analysis, going at least two or three "whys" deep.
  • How did we respond? What went well, what could have gone better in the response process.
  • What will we do to prevent recurrence? Concrete action items with owners and deadlines.

Postmortems should be blameless. The goal is to improve systems and processes, not to assign fault. An engineer who deployed a buggy change is not the root cause. The lack of automated testing, canary deployments, or rollback procedures that allowed the bug to reach production is.

Track postmortem action items in your project management tool and review them regularly. An action item without a deadline and an owner is just a wish.

Runbooks

Runbooks are pre-written guides for handling specific types of incidents. A runbook for "database connection pool exhaustion" might include steps to check active connections, identify long-running queries, restart connection pools, and scale the database if needed.

Write runbooks for your most common incident types. Keep them in a location the on-call team can access quickly during an incident (a shared wiki, a docs repo, or directly linked from your monitoring alerts). Review and update runbooks after each postmortem.

The next chapter covers how to communicate with users during incidents through status pages.

Get started

Start monitoring in the next 5 minutes.

Stop letting customers discover your outages first. Set up monitoring, status pages, on-call, and alerts before your next coffee break.

14 days free trial — No card required