Storage network degradation
Incident Report for M.D.G. IT
Postmortem

We understand the impact that outages such as this have on business, especially in the online shopping space in the current ongoing situation, and would like to reassure clients that uptime is something M.D.G. IT takes extremely seriously.

The catalyst of the series of events that led to service interruption and outages was power maintenance works by Equinix, the datacenter operator of SY1:

The A2 Triplen Transformers have been identified as having a need to perform additional maintenance due to the detection of an end of life voltage in-rush time delay relay. Equinix engineers, in addition to on-site vendors will perform the additional maintenance in order to minimize associated risk and disruption to the operating environment.

During the scheduled window, we are required to shutdown the A2 block primary and redundant service Triplen Transformers in a staged schedule to permit the review of and replacement of these in-rush voltage relays as well as a parallel installation of an additional voltage sensing relay to introduce further redundancy to this component.

Equinix is the world's largest data centre and colocation infrastructure provider, and over more than a decade of hosting M.D.G. IT networks and equipment have provided exceptional uptime on power, HVAC and cross connects. On this occasion, however, maintenance was scheduled during a time when M.D.G. IT engineers could not be on site due to COVID travel restrictions, and engineers were similarly unable to attend the datacenter prior to the scheduled maintenance to perform the complex replugging required to test for an hours-long interruption to power. M.D.G. IT management urged Equinix to delay the scheduled maintenance so that engineers could be on site, and were informed that the maintenance was critical and could not be postponed. This allowed only for the replugging of critical single corded equipment by Equinix technicians unfamiliar with the equipment they were maintaining.

Equinix SY1 has 4 x 2.2 MVA diesel generators and 9 x 500 kVA Uninterruptible Power Supplies, allowing a 99.999+% power availability Service Level Agreement on redundant power and 99.99%+ SLA on non-redundant power. This translates to permissible outages of less than 26 seconds and 240 seconds per month respectively. On this occasion the power to the primary PDU was interrupted for just under 3 hours.

No M.D.G. IT equipment lost complete power when the primary feed was interrupted, however CPU speed on multiple 56-core blade servers was throttled, either due to the extended loss of the primary power feed to the blade chassis or as a result of thermal throttling. An investigation is still underway to determine the exact cause of the processor speed reduction. This caused the total CPU usage of the dedicated hypervisor servers to increase from an average of 30% to 95%, causing guest VM processes to start queuing. This, combined with the design of Magento’s cron system, led to a high number of waiting IO operations, which resulted in extremely high load on the Storage Area Networks when power was restored and CPU speed scaled back up, and significantly increased iowait across the virtual machine pool. This in turn caused processes to start queuing further, consuming more RAM and eventually swap, which placed further load on the SANs.

In response, M.D.G. IT staff shut down non-essential internal services that add load to the storage network, including in-progress backups, and implemented guest VM IO throttling in order to limit IO contention at the network and disk level. Virtual Machines that crashed due to extreme iowait-induced load averages were progressively restarted as the IO queue depth gradually reduced and SAN loading normalised.

It is worthwhile noting that this is the first power interruption in this datacenter in over ten years. As described above, a thorough investigation into processor throttling is being conducted, ahead of testing of power fail-over impacts on processor speed. In addition, storage rebalancing has been commenced to increase the distribution of IO of servers in individual cabinets across a higher number of SANs to avoid storage bottleneck.

Posted Jul 29, 2021 - 20:32 AEST

Resolved
This incident has been resolved.
Posted Jul 28, 2021 - 15:29 AEST
Update
We are continuing to monitor for any further issues.
Posted Jul 28, 2021 - 11:05 AEST
Update
Some VMs are still experiencing higher than normal storage latency as the backlog of storage traffic is processed
Posted Jul 28, 2021 - 11:04 AEST
Update
We are continuing to monitor for any further issues.
Posted Jul 28, 2021 - 09:47 AEST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 28, 2021 - 09:47 AEST
Investigating
M.D.G. IT engineers are investigating an issue where customers may experience increased storage area network latency on some virtual machines.
Posted Jul 28, 2021 - 08:56 AEST
This incident affected: VPS Hosting, Sydney.