Network issues identified in Equinix SY1—switch degradation
Incident Report for M.D.G. IT
Postmortem

INCIDENT REPORT 20220808—SWITCH FAILURE, SWITCHING FABRIC RECOVERY FAULTS, BORDER ROUTER SESSIONS FAULTS

AFFECTED SYSTEMS: Virtual machines in Equinix SY1; M.D.G. IT ticket support system

DESCRIPTION: At approximately 10:15 am AEST on Monday 8 August, a core network switch carrying IP and iSCSI storage traffic failed, and traffic automatically failed over to the redundant switch in the pair. This caused some servers to pause due to the storage network interruption.

While internal switching failed over instantly, unexpected behaviour was observed on the internal switching fabric and external router sessions. This affected a subset of services connected to the devices, and appeared to be localised to specific ISPs.

A secondary effect of the internal switching change was that several bare metal hypervisor servers started flapping in availability to the virtualisation pool. While up and able to run virtual machines, the servers were intermittently marked as up / down every several minutes to the virtualisation manager. While flapping, this slowed stop / start operations on other machines to the point of timeout. While these servers were connected to the redundant switch and passing traffic to virtual machines without packet loss, the effects of the flapping behaviour required them to be removed from the pool. This removed 96 CPU threads and 400GB of memory from Apollo-KVM, leading to memory contention until servers were migrated or rebalanced across remaining machines in the pool. This led to further issues on a number of servers overnight Mon / Tues as a result of backups and swap usage leading to high iowait.

At approximately noon AEST on Thursday 11 August, the routing cluster automatically reconfigured its master node in response to an interface alerting threshold being reached on the master node. This is an automated live fail-over process which typically takes several seconds, however in this case the BGP sessions were lost and network sessions were dropped until BGP re-converged.

CAUSES: Switch failure, contention on KVM pool after server removal, router autoconfiguration dropped BGP.

RECTIFICATION: NOC staff immediately started recovering paused servers by powering down and rebooting the affected machines.

Network engineers localised the primary networking issue to border router session initiation. After being unable to resolve the session faults on the running cluster, the BGP routing cluster was rebooted.

KVM pools were rebalanced after the removal of affected servers to restore resource availability.

FUTURE MITIGATION:

Affected portions of switching fabric are being replaced with new hardware and tested for failover. Further, network architecture is being updated and tested to remove possible interdependencies between network layers. Critical works are expected to be completed 19 August.

Posted Aug 12, 2022 - 07:08 AEST

Resolved
All services are now operational. Please contact support if you're still seeing ongoing issues on any virtual machines.

An incident report will be made available at https://status.mdg-it.com.au/incidents/2tcw034x498l once a full post mortem has been completed.
Posted Aug 08, 2022 - 17:24 AEST
Monitoring
A fix has been implemented, services should be restored within 30 minutes.
Posted Aug 08, 2022 - 15:32 AEST
Update
We are continuing to work on a fix for this issue.
Posted Aug 08, 2022 - 13:42 AEST
Identified
On-site network engineers have localised the issue, and expected time to fix will be posted as soon as it is available.
Posted Aug 08, 2022 - 13:42 AEST
Investigating
We are currently investigating this issue.
Posted Aug 08, 2022 - 10:32 AEST
This incident affected: VPS Hosting, Sydney.