Yesterday, 19th January 2021 at 10:30 we declared an emergency and alerted you about an issue on our network, which later caused a complete loss of service to a number of clients connected to our hosted phone system.
At around 10:25 our automated network checks alerted us about a fluctuating link between the data centres which house our voice and database servers. We took the necessary steps to reroute traffic over other links to avoid any disturbance to ongoing calls and confirmed automated actions our platform takes in scenarios like this. To begin with, this was successful and not service impacting. Between 10:30 - 10:33 the regained connections from the voice to our database servers; which were pending due to the fluctuating link, were trying to re-establish in order to correct the state of the calls and finish correctly. This load and mass of connections caused an overload on our database servers, preventing any further calls.
To reduce the connections to the database, all secondary services such as the hosted web portal, dialler and BLF were disabled.
We then proceeded to investigate and mitigate the impact this outage had on our customers and were able to regain service at around 11:10. Within the 40 minutes of disturbance, a few thousand outbound calls were able to be made, but full service could not be resumed until connections to our database servers were reduced and voice servers cleared.
During this incident, inbound calls were diverted to our system disaster recovery platform purposely designed and built for a scenario like this. Since it's been in service, this was the second time it had to be enabled and we handled an amazing 85% of our inbound call volume which diverted to backup destinations set up on our platform in advance.
We have already begun internal investigations as to why the connections to our database servers were overloaded and we will implement solutions to prevent this from happening in future.
We know how important our service reliability is to you and therefore strive to reach 100% uptime.
We are truly sorry this issue was service affecting and will take any necessary steps to prevent it from happening again.