On April 27, 2025, there was a brief outage within our infrastructure between 11:12 PM and 11:17 PM. The cause was a memory overload on one of our hosts, which became unresponsive. Our monitoring system triggered an alarm within a minute of the outage and notified the responsible on-call staff member.
Timeline of the Incident:
11:12 PM – The host becomes unresponsive due to a memory overload.
11:13 PM – The responsible on-call staff member begins the investigation.
11:14 PM – The monitoring system triggers an alert.
11:16 PM – The affected host is identified as the root cause. A live migration of affected virtual machines is initiated. Non-critical VMs are powered down to release resources.
11:17 PM – The infrastructure is back online, and the migration is ongoing.
11:19 PM – The host is fully back online. A live migration of the remaining VMs is conducted, which may lead to minimal performance dips in the minutes following.
Cause of the Incident:
The incident was caused by a memory overload on a host system. This resulted in the host being unable to respond to requests, temporarily affecting parts of the infrastructure.
Remedial Actions:
After being alerted by the monitoring system, the host's overload was identified. Non-essential, non-critical VMs were promptly shut down, and a load distribution among the remaining systems was conducted. Simultaneously, a live migration of the affected virtual machines to other hosts was initiated. This quickly alleviated the host and restored the infrastructure.
Reflection and Preventive Measures:
To ensure such an incident is avoided in the future, we are implementing the following measures:
Adjusting Monitoring Thresholds:
We will adjust the threshold for RAM consumption in the monitoring system so that an impending overload can be detected earlier, allowing timely response.Automatic Load Distribution:
Moving forward, we will introduce automatic load distribution for our non-critical VMs to preemptively prevent exceeding critical thresholds.

