Failover is the process of automatically or manually switching operations from a failed primary system to a standby (secondary) system. The goal is to maintain service continuity with minimal disruption when the primary system experiences a failure.
Failover can be automatic (triggered by health checks detecting a failure) or manual (initiated by an operator). Automatic failover is faster but requires careful configuration to avoid false triggers. The time it takes to complete a failover is a key factor in achieving your RTO (Recovery Time Objective).
Common failover architectures include active-passive (standby system takes over when primary fails), active-active (multiple systems share the load and absorb each other's traffic during failures), and DNS-based failover (DNS records are updated to point to backup servers). Monitoring plays a critical role in failover — tools like Hyperping can detect when a primary endpoint goes down and verify that the failover endpoint is serving correctly.