The 14th and 23rd of December 2011, two inci­dents impact­ed the smooth func­tion­ing of our ser­vices for a major­i­ty of our clients.

What happened

December 14th

The 14th of December at about 18:30, we detect­ed a pack­et loss of about 3% between some of our servers (http4 and http6). This pack­et loss had imme­di­ate reper­cus­sions for the accounts con­cerned: the com­mu­ni­ca­tion between these servers and our oth­er inter­nal servers (SQL, SSH, FTP) was sig­nif­i­cant­ly slowed. In con­crete terms, the dis­play time of a Web page mak­ing many access­es to the data­base was increased from 0.5 to more than 5 sec­onds.

We imme­di­ate­ly informed our provider of this pack­et loss so that it could fix it. Less than 30 min­utes after, the prob­lem was gone.

The pack­et loss reap­peared lat­er in the evening, before dis­ap­pear­ing again. Again the fol­low­ing day, the prob­lem reap­peared ran­dom­ly. Given the dif­fi­cul­ties of our provider in iso­lat­ing and resolv­ing the source of the prob­lem — the spo­radic nature not help­ing — we decid­ed to deploy an emer­gency fea­ture that it offers: the VLAN. This allows our servers to com­mu­ni­cate between each oth­er via a spe­cial route, iso­lat­ed from the pub­lic net­work.

Having quick­ly com­plet­ed sev­er­al tests and assured ouselves that the ini­tial pack­et loss prob­lem was resolved in going through the VLAN, we start­ed up deploy­ing it on our impact­ed servers. This deploy­ment required a ker­nel update, and there­fore a restart of sev­er­al servers. At the end of the night, the prob­lem was resolved, with the excep­tion of the still par­tial­ly slow FTP.

Note that we were plan­ning to use the VLAN — a fea­ture launched for sev­er­al months by our provider — in the first half of 2012. Why not before­hand? Because to be deployed cor­rect­ly, this needs some time; the cur­rent deploy­ment remains rel­a­tive­ly shaky and tem­po­rary. In addi­tion, this fea­ture is not free from prob­lems, and we pre­fer not to kick the can down the road.

The 23rd of December

The ini­tial prob­lem — the ran­dom pack­et loss — per­sist­ed how­ev­er, although it no longer impact­ed us. The 23rd of December at 22:05, access to our two servers http4 and http6 became great­ly dis­turbed: more than 50% pack­et loss. This time the prob­lem con­cerned not only inter­nal traf­fic but also exter­nal traf­fic (from the Internet to our servers). As a result, access to all sites became extreme­ly dif­fi­cult (all oth­er ser­vices were not impact­ed).

We imme­di­ate­ly brought back the prob­lem to our provider, and decid­ed in par­al­lel to go around it by redi­rect­ing the HTTP traf­fic through oth­er not impact­ed servers, the lat­ter serv­ing as a proxy to the dis­turbed servers (by com­mu­ni­cat­ing via the VLAN). By mid­night — it took time to wait for DNS prop­a­ga­tion- the prob­lem was for the most part resolved, at least in the­o­ry. By 1:30, our provider was able to iden­ti­fy and solve the prob­lem. We were then able to switch back to the traf­fic on the home HTTP servers.

Conclusions

Several con­clu­sions can be drawn from these dis­tur­bances:

  • The man­age­ment of this fail­ure by our provider was insuf­fi­cient. They’ll hear from us and we’ll ensure that ran­dom prob­lems be dealt with more effec­tive­ly. In addi­tion, we will cer­tain­ly be more insis­tent if this hap­pens again
  • Circumvention of the prob­lems, espe­cial­ly the redi­rec­tion of HTTP to servers not impact­ed, was sat­is­fac­to­ry. However, we not­ed some points that can be improved and that would allow us to respond more quick­ly if the sit­u­a­tion were to recur
  • Our mon­i­tor­ing was insuf­fi­cient in case of pack­et loss. It’s not a sur­prise; the redesign of our mon­i­tor­ing is sched­uled for sec­ond quar­ter 2012
  • The imple­men­ta­tion of the VLAN should allow us to improve the sta­bil­i­ty of our ser­vices. Failures 10, 11 and 15 would cer­tain­ly have been avoid­ed, for exam­ple.

We offer our apolo­gies to all cus­tomers impact­ed, espe­cial­ly dur­ing this Christmas sea­son. See you soon for bet­ter news 🙂