The 14th and 23rd of December 2011, two incidents impacted the smooth functioning of our services for a majority of our clients.
The 14th of December at about 18:30, we detected a packet loss of about 3% between some of our servers (http4 and http6). This packet loss had immediate repercussions for the accounts concerned: the communication between these servers and our other internal servers (SQL, SSH, FTP) was significantly slowed. In concrete terms, the display time of a Web page making many accesses to the database was increased from 0.5 to more than 5 seconds.
We immediately informed our provider of this packet loss so that it could fix it. Less than 30 minutes after, the problem was gone.
The packet loss reappeared later in the evening, before disappearing again. Again the following day, the problem reappeared randomly. Given the difficulties of our provider in isolating and resolving the source of the problem — the sporadic nature not helping — we decided to deploy an emergency feature that it offers: the VLAN. This allows our servers to communicate between each other via a special route, isolated from the public network.
Having quickly completed several tests and assured ouselves that the initial packet loss problem was resolved in going through the VLAN, we started up deploying it on our impacted servers. This deployment required a kernel update, and therefore a restart of several servers. At the end of the night, the problem was resolved, with the exception of the still partially slow FTP.
Note that we were planning to use the VLAN — a feature launched for several months by our provider — in the first half of 2012. Why not beforehand? Because to be deployed correctly, this needs some time; the current deployment remains relatively shaky and temporary. In addition, this feature is not free from problems, and we prefer not to kick the can down the road.
The initial problem — the random packet loss — persisted however, although it no longer impacted us. The 23rd of December at 22:05, access to our two servers http4 and http6 became greatly disturbed: more than 50% packet loss. This time the problem concerned not only internal traffic but also external traffic (from the Internet to our servers). As a result, access to all sites became extremely difficult (all other services were not impacted).
We immediately brought back the problem to our provider, and decided in parallel to go around it by redirecting the HTTP traffic through other not impacted servers, the latter serving as a proxy to the disturbed servers (by communicating via the VLAN). By midnight — it took time to wait for DNS propagation- the problem was for the most part resolved, at least in theory. By 1:30, our provider was able to identify and solve the problem. We were then able to switch back to the traffic on the home HTTP servers.
Several conclusions can be drawn from these disturbances:
- The management of this failure by our provider was insufficient. They’ll hear from us and we’ll ensure that random problems be dealt with more effectively. In addition, we will certainly be more insistent if this happens again
- Circumvention of the problems, especially the redirection of HTTP to servers not impacted, was satisfactory. However, we noted some points that can be improved and that would allow us to respond more quickly if the situation were to recur
- Our monitoring was insufficient in case of packet loss. It’s not a surprise; the redesign of our monitoring is scheduled for second quarter 2012
- The implementation of the VLAN should allow us to improve the stability of our services. Failures 10, 11 and 15 would certainly have been avoided, for example.
We offer our apologies to all customers impacted, especially during this Christmas season. See you soon for better news :)