Root cause analysis (RCA) is a systematic process for identifying the underlying cause of an incident or problem, rather than just addressing its symptoms. The goal is to understand why the incident occurred so that preventive measures can be implemented to avoid recurrence.
Common RCA techniques include the "Five Whys" (repeatedly asking "why" until you reach the fundamental cause), fishbone diagrams (categorizing potential causes by type), timeline analysis (mapping events chronologically to identify the trigger), and fault tree analysis (modeling how combinations of failures led to the incident).
In software engineering, root causes often fall into categories like code defects, configuration errors, capacity issues, dependency failures, human errors in operations, or process gaps. Modern SRE practice emphasizes that most incidents have multiple contributing factors rather than a single root cause, and the most impactful improvements often come from addressing systemic issues (like lack of monitoring coverage or insufficient testing) rather than the immediate trigger.