Resolved | Feb 22, 2022 | 13:03 GMT+01:00
The issue related to message delivery is resolved for all clients.
Monitoring | Feb 22, 2022 | 13:00 GMT+01:00
On 19 February 2022 from 11:15 to 16:45 UTC and again on 21 February 2022 from 08:15 to 12:45 UTC message delivery was disrupted for a limited set of WhatsApp numbers hosted by Turn. During this time affected numbers would be unresponsive or slow to receive and process messages.
What caused it?
We will use a simple comparison to explain the incident for our non-technical users, and then elaborate with all the technical details.
Let's say we have a bus with WhatsApp numbers on Turn as passengers on the bus, and each passenger brought a lot of luggage with them. The luggage represents each service starting up to be able to send and receive messages. The bus broke down and another bus was called in immediately to transfer the passengers to, but the passengers had so much luggage that it took some time to do the transfer. During the time it took to do the transfer service interruption was experienced.Now let’s explain it in detail.
Turn hosts WhatsApp numbers on a collection of clustered virtual machines. These virtual machines need to be replaced as part of general maintenance or upgraded from time to time. The process of replacing virtual machines involves creating new virtual machines followed by moving workloads from the existing machines to the new machines. Moving workloads involves first stopping a workload on an existing virtual machine, then starting it again on a new machine.
In the general case this process of moving a workload from one virtual machine to another should not result in noticeable service disruption as stopping and starting of any given workload should complete within 30 seconds. However, during startup some but not all workloads can cause a spike in CPU usage. This can lead to CPU contention and a delay in process readiness if a large number of workloads exhibiting this behavior are started at the same time.
The incident on 19 February was triggered by an automated process replacing a number of virtual machines in Turn's hosting environment. However in this case the workloads on the virtual machines being replaced were inadequately distributed on the virtual machines being replaced in such a way that the majority of them would exhibit a large spike in CPU usage during startup. This led to CPU contention as the virtual machines were replaced and a large number of expensive processes restarted simultaneously, leading to significant delays in workloads being ready to process messages.
The incident on 21 February was caused in a similar fashion by CPU contention, however in this case it was triggered by rollout of configuration changes to the Turn hosting environment meant to address some of the problems encountered on the 19th. Specifically, workloads were redistributed more equally across all virtual machines in the hosting environment. However, even with more balanced workloads, CPU contention was still observed on some virtual machines during process startup, again resulting in significant delays in workloads being ready to process messages.
What are we doing about it?
To address this incident the following fixes having been released:
- Paused the automatic GCP upgrade schedule until we reduce the process startup times. Automatic GCP maintenance can still happen and cause disruption, so our key focus is on reducing process startup times.
- Rebalance workloads over virtual machines to reduce potential for CPU contention during workload startup.
Work is still ongoing on the following fixes:
- Optimize workloads to eliminate CPU usage spikes during startup and thus the potential for CPU contention.
- Reconfigure Turn hosting environment to be more conservative in terms of allowing workload disruption. This means that in the case of virtual machines being replaced that only a small number of workloads will be moved at a time to ensure that even if CPU spikes were to occur during startup, that they would not lead to CPU contention.
- Clearly schedule maintenance windows on Turn’s status page to surface potential for disruption.