The Cost of a Single Point of Failure
In enterprise networking, a single point of failure (SPOF) is any component whose failure causes the entire network — or a critical segment — to stop functioning. A single ISP connection, a single core switch, a single gateway. Most networks have more SPOFs than their IT teams realize, and they discover them at the worst possible moment: during a critical business operation, a client presentation, or a peak revenue period.
True high availability (HA) networking is not about buying expensive equipment. It is about designing redundancy into every critical path before deployment, then testing those redundancy mechanisms regularly.
Layer 1: WAN Redundancy
Dual-ISP Configuration with UniFi
The UniFi Dream Machine SE and Enterprise Fortress Gateway support dual WAN interfaces with automatic failover and load balancing. The configuration options include:
- Active/Passive Failover: Primary ISP carries all traffic. Secondary activates only when primary fails. Failover time: under 30 seconds with proper health-check configuration.
- Active/Active Load Balancing: Both ISPs carry traffic simultaneously, weighted by capacity. Failover is essentially instant since the secondary is already carrying traffic.
- Per-source or Per-destination load balancing: Advanced configurations route specific traffic types (VoIP, critical SaaS) through the preferred ISP.
For truly critical environments, WAN connections should come from providers using physically separate cable routes — not just different ISPs sharing the same conduit to the building.
Layer 2: Switching Redundancy
Link Aggregation (LAG/LACP)
Uplink aggregation groups multiple physical switch ports into a single logical connection, providing both bandwidth multiplication and failover. A 2-port LAG between the core switch and gateway delivers 2 Gbps of throughput and continues operating at 1 Gbps if one port or cable fails. UniFi configures LAG under switch port profiles with LACP enabled.
Stacked and Redundant Core Switching
For mission-critical environments, the core switching layer itself must be redundant. UniFi's enterprise switches support redundant power supplies (dual PSU models). Spanning Tree Protocol (STP) or Rapid Spanning Tree (RSTP) prevents broadcast storms in redundant physical topologies while maintaining path failover.
Layer 3: Power Redundancy
Network availability depends entirely on power availability. An otherwise perfectly redundant network fails instantly if the UPS is undersized or not configured for extended runtime during power events. Every professional high-availability deployment includes:
- UPS with sufficient capacity for all networking equipment plus 15-20% margin
- Runtime calculation based on actual load (not theoretical maximum)
- Integration with generator feed for extended power events
- Network-aware UPS systems that perform graceful shutdowns if runtime is exhausted
- UPS battery replacement schedule (typically every 3-4 years)
Monitoring and Proactive Incident Response
High availability architecture is only valuable if someone is watching it. UniFi's monitoring stack — combined with external monitoring services — provides:
- Real-time alerts for WAN failover events
- ISP latency and packet loss monitoring
- Port flap detection on critical switch uplinks
- AP offline alerts with location context
- Gateway CPU and memory utilization thresholds
In a managed services agreement, our NOC receives these alerts and begins remote diagnosis before the client's IT team is even aware of an issue.
Testing Your Redundancy: The Step Most Teams Skip
Redundancy that has never been tested is redundancy that may not work when needed. A proper HA implementation includes scheduled failover tests:
- Quarterly WAN failover tests: disconnect the primary ISP and verify traffic routes to secondary within SLA
- Annual core switch failover: simulate a switch failure and verify network continues operating
- Biannual UPS runtime test: bring the network to UPS power and measure actual runtime
Frequently Asked Questions
How quickly does UniFi switch to the backup ISP?
With health-check intervals configured at 5 seconds and a failure threshold of 3 checks, failover initiates within 15-20 seconds of ISP failure detection. Total failover time including BGP or policy-based routing reconvergence is typically under 30 seconds.
Do we need two physical gateway devices for true HA?
UniFi does not currently support active/active gateway clustering at the hardware level in the same way some enterprise vendors do. However, dual-WAN failover on a single gateway, combined with redundant switching and power, addresses the majority of failure scenarios for most enterprises. For environments requiring sub-second gateway failover, a dedicated hardware firewall cluster may be required.
What is the difference between high availability and disaster recovery?
High availability addresses continuous operations through component failure. Disaster recovery (DR) addresses catastrophic failures — building loss, data center destruction. Both are relevant for enterprise planning but require different architectural approaches.