On April 21, 2025, at 9:32 PM, it was observed that external access to the network infrastructure was restricted. This limitation was triggered by an alert from our external monitoring system. An on-call employee was immediately notified through our external monitoring.
Timeline of the Incident
- 9:32 PM – Alert from external monitoring due to interrupted external connections.
- 9:35 PM – On-call staff begin to analyze the network connections.
- 9:40 PM – The cause is identified, and initial measures are initiated.
- 9:50 PM – The network infrastructure is fully operational again.
Cause Analysis
The outage was caused by human error in conjunction with an automated system process. During an automated system update, an IPv4 Link-Local Unicast address was removed from the router configuration. This address was intended as a connection channel between two routers and was used by the Keepalived service to dynamically switch the gateway in the event of a failover. The removal of this address prevented the keep-alive connection between the routers from being established, which led to the outage.
The situation was further complicated as the redundant router was unavailable due to planned maintenance at that time. As a result, there was no fallback option, leading to a total interruption of external access to the systems.
Remediation Measures
After identifying the problem, the removed address was reinserted into the configuration, and a router reboot was performed. These measures led to a complete restoration of the network infrastructure within 18 minutes.
Reflection and Preventive Measures
To prevent similar incidents in the future, the following steps have been decided:
- Increase Redundancy: Future maintenance will ensure that a complete failover path remains active, and no critical infrastructure is operated without a backup.
- Stabilization of System Updates: Automatic updates will be adjusted to prevent the accidental removal of essential configuration elements such as link-local addresses.
- Enhanced Monitoring and Configuration Management: Changes to network-critical components will be monitored more closely and documented transparently in the future.
Summary
The network outage on April 21, 2025, was caused by the unintended removal of an essential IPv4 Link-Local address for the Keepalived system during an automated system update. The simultaneous absence of a redundant router during planned maintenance exacerbated the incident. However, the issue was fully resolved within 18 minutes due to a rapid response. Specific technical and organizational measures have already been implemented to prevent future outages.
