2 min.

Post Mortem: Brief Outage on April 27, 2025 Due to Host Overload (Out of Memory)

Post Mortem: Brief Outage on April 27, 2025 Due to Host Overload (Out of Memory)
Image: a_roesler / Pixabay

On April 27, 2025, there was a brief outage within our infrastructure between 11:12 PM and 11:17 PM. The cause was a memory overload on one of our hosts, which became unresponsive. Our monitoring system triggered an alarm within a minute of the outage and notified the responsible on-call staff member.

Timeline of the Incident:

  • 11:12 PM – The host becomes unresponsive due to a memory overload.
  • 11:13 PM – The responsible on-call staff member begins the investigation.
  • 11:14 PM – The monitoring system triggers an alert.
  • 11:16 PM – The affected host is identified as the root cause. A live migration of affected virtual machines is initiated. Non-critical VMs are powered down to release resources.
  • 11:17 PM – The infrastructure is back online, and the migration is ongoing.
  • 11:19 PM – The host is fully back online. A live migration of the remaining VMs is conducted, which may lead to minimal performance dips in the minutes following.

Cause of the Incident: The incident was caused by a memory overload on a host system. This resulted in the host being unable to respond to requests, temporarily affecting parts of the infrastructure.

Remedial Actions: After being alerted by the monitoring system, the host's overload was identified. Non-essential, non-critical VMs were promptly shut down, and a load distribution among the remaining systems was conducted. Simultaneously, a live migration of the affected virtual machines to other hosts was initiated. This quickly alleviated the host and restored the infrastructure.

Reflection and Preventive Measures: To ensure such an incident is avoided in the future, we are implementing the following measures:

  1. Adjusting Monitoring Thresholds: We will adjust the threshold for RAM consumption in the monitoring system so that an impending overload can be detected earlier, allowing timely response.
  2. Automatic Load Distribution: Moving forward, we will introduce automatic load distribution for our non-critical VMs to preemptively prevent exceeding critical thresholds.