Written by

On Sunday 21st October, the http4 serv­er was unavail­able between 01.02 and 10.17 (UTC+2). This arti­cle goes over this out­age, which was abnor­mal­ly long: what hap­pened in detail, what we are going to do to ensure that it does­n’t hap­pen again and a clar­i­fi­ca­tion on the issue.

What happened

On Saturday, at 16.51, the http4 serv­er com­plete­ly froze. We restart­ed it, then mon­i­tored the sit­u­a­tion — the logs did­n’t give any indi­ca­tion. A freeze can be caused by a prob­lem with either hard­ware or soft­ware; it was impos­si­ble at that moment to be sure of the cause.

At 22.01, the prob­lem recurred. This time the logs allowed us to be sure that the cause was hard­ware-relat­ed (moth­er­board dying). With our provider, we there­fore arranged the replace­ment of the board at 01.00.

At 01.02, the serv­er was tak­en out of ser­vice to replace the moth­er­board. This type of oper­a­tion would nor­mal­ly take less than 30 minutes.

After chang­ing the moth­er­board, the serv­er would­n’t start. The whole RAM was changed, which solved this prob­lem. A sec­ond anom­aly, the net­work card (Intel 10Gb) was dis­play­ing in a loop:

The rest of the sys­tem was func­tion­ing well, the ker­nel was behav­ing nor­mal­ly. The tech­ni­cian there­fore tried to iden­ti­fy the cause of the prob­lem, notably by:

  • check­ing the cables
  • check­ing the switch configuration
  • chang­ing the net­work card for an iden­ti­cal model
  • rein­stalling the old motherboard
  • rein­stalling the new motherboard
  • updat­ing the BIOS of the motherboard
  • updat­ing the firmware of the net­work card

Without suc­cess. The tech­ni­cian then decid­ed to do an emer­gency set-up of a new serv­er, with an iden­ti­cal con­fig­u­ra­tion, and to insert our server’s hard dri­ves. This did not resolve the problem.

We must add one impor­tant detail: only our own Linux ker­nel (the one which we com­pile and main­tain, with our own con­fig­u­ra­tion) showed signs of the bug. The stan­dard ker­nel of our provider was work­ing cor­rect­ly, which seemed to exclude a hardware/cable prob­lem. Nevertheless, our ker­nel was work­ing per­fect­ly well up until then on this serv­er… which makes the issue even more mysterious.

Finally, a lit­tle before 10.00, a senior tech­ni­cian was able to retrieve a slight­ly dif­fer­ent type of net­work card (still an Intel 10Gb): this final­ly resolved the problem.

Decreasing risks in the future

The orig­i­nal prob­lem — the net­work card sud­den­ly refus­ing to work — remains mys­te­ri­ous and unclear. It could have been a par­tic­u­lar­ly unpre­dictable bug in the ker­nel dri­ver, for exam­ple. It is impos­si­ble to pro­tect against this type of anom­aly, which is for­tu­nate­ly excep­tion­al­ly rare.

It is nev­er­the­less the first time in 6 years that we have faced such a sig­nif­i­cant down­time of the serv­er. We can draw sev­er­al con­clu­sions from it.

Favour the most up-to-date servers

Our provider (OVH) offers a dif­fer­ent range of servers. The http4 serv­er, dat­ing from the begin­ning of 2010, is part of the top range, with even more pro­tec­tion than the more tra­di­tion­al servers (for exam­ple, they have a dual sup­ply or 10 Gb/s net­work connections).

From expe­ri­ence (we have a wide range of servers), we can state that:

  • the the­o­ret­i­cal­ly supe­ri­or reli­a­bil­i­ty of the high-end servers is not sig­nif­i­cant­ly dif­fer­ent from that of oth­er ranges
  • the stock of high-end servers of our provider is much more lim­it­ed than that of oth­er servers. Consequently, and accord­ing to our sub­jec­tive and lim­it­ed experience: 
    • these servers seem to be more sus­cep­ti­ble to cer­tain types of “rare” bugs (hard­ware, routers): the more there are users of equip­ment X, the more quick­ly the poten­tial bugs will be detect­ed and corrected
    • the tech­ni­cians (lev­el 1) being less fre­quent­ly exposed to these servers, will have over­all less expe­ri­ence with them. Invaluable expe­ri­ence in case of urgent problems
    • the chances of hav­ing a replace­ment serv­er, ready to use, are much less

To sum­marise, we can state that the reli­a­bil­i­ty of high-end servers will para­dox­i­cal­ly be infe­ri­or to that of oth­er ranges. We are there­fore pro­gres­sive­ly phas­ing out the use of these servers (in shared host­ing, http4 and http6 are involved).

If http4 had been on a stan­dard serv­er, it is pos­si­ble that the mys­te­ri­ous bug on the net­work card would have been dis­cov­ered and resolved before it had any impact on us. In addi­tion, the spare servers would have prob­a­bly been avail­able more quickly.

Ensure that our servers can start up with our provider’s kernel

Our provider allows us to use their own Linux pre­com­piled ker­nel. We don’t use them for sev­er­al rea­sons, notably the impos­si­bil­i­ty of choos­ing the ver­sion of it, of apply­ing patch­es or mod­i­fy­ing the configuration.

If it is not then a ques­tion of us using our provider’s stan­dard ker­nels (because of the lim­i­ta­tions men­tioned), it would be prefer­able that in an emer­gency, it would be at least pos­si­ble to boot our servers with them. It would only be to help with the debugging.

In prac­ti­cal terms, the changes to be made to the con­fig­u­ra­tion of our servers are minimal.

If we had been able to start up http4 on the ker­nel of our provider, the serv­er would have been acces­si­ble again a lot more quick­ly. Even if it meant that some non-essen­tial func­tions were disabled.

Allow rapid access to data in rescue mode

In case of prob­lems, it is pos­si­ble to start up the servers in res­cue mode (boot via TFTP/NFS). The hard disks are not used at all by the sys­tem, which the­o­ret­i­cal­ly guar­an­tees that the res­cue mode is acces­si­ble even in the the case of a con­fig­u­ra­tion problem.

Henceforth, we are going to get our own spare serv­er. In instances of seri­ous hard­ware or soft­ware prob­lems which can­not be resolved quick­ly, we will then have the option of start­ing the serv­er in ques­tion in res­cue mode, then export­ing its share of data to the emer­gency serv­er. So the lat­ter will take over from the bro­ken serv­er, which can then be debugged calm­ly after­wards. We have sim­u­lat­ed this switch over the last few days, with success.

Level 2 technicians, 24/7

At the present time, lev­el 2 tech­ni­cians are not sys­tem­at­i­cal­ly present every night. Specifically, the issue hap­pened on a Sunday, when the lev­el 2 tech­ni­cians did not arrive before 10.00. Our provider told us they work on this point and will pro­ceed with numer­ous recruit­ments to reach the tar­get of hav­ing lev­el 2 tech­ni­cians 24/7.

And the redundancy?

Up until recent­ly, our servers were redund­ed in Ireland, using the DRBD tech­nol­o­gy. This redun­dan­cy in real time, put in place from the begin­ning of always­da­ta, aimed to over­come seri­ous out­ages (hard­ware, net­work) by switch­ing to the sec­ondary dat­a­cen­ter. Why, there­fore, have we not switched?

We’ve had this redun­dan­cy for sev­er­al years, but we pro­gres­sive­ly start­ed to dis­able it on some servers, for the fol­low­ing reasons:

  • the sta­bil­i­ty was not suf­fi­cient. DRBD was too often the source of freezes/crashes on the pri­ma­ry serv­er, and this in spite of fre­quent updates. It is enough to look at the changel­og to note that each new ver­sion fix­es fair­ly seri­ous bugs. We use DRBD in a WAN mode which isn’t the most fre­quent, and this could explain the rel­a­tive instability
  • in 6 years, we had nev­er had such an excep­tion­al down­time that jus­ti­fied switch­ing to the emer­gency dat­a­cen­ter (the http4 out­age is the very first). This is good news: it shows that the sta­bil­i­ty of our main provider is very good
  • redun­dan­cy has a cost: 
    • (quite low) on the IO per­for­mance of servers
    • (not insignif­i­cant) on the com­plex­i­ty of our architecture
    • (low) finan­cial­ly
  • the less we have the oppor­tu­ni­ty to real­ly switch to the sec­ondary dat­a­cen­ter, the more we run the risk that every­thing won’t run smooth­ly on the day when we have to do it — in spite of simulations

In con­clu­sion: redun­dan­cy has nev­er served us, but has caused down­time on sev­er­al occa­sions. It’s unde­ni­ably a para­dox: redun­dan­cy has low­ered our aver­age uptime.

We can how­ev­er ask our­selves why redun­dan­cy was not acti­vat­ed dur­ing small­er down­times (>= 30 minutes):

  • the small­er down­times of hard­ware or net­work (the only ones like­ly to be elim­i­nat­ed by switch­ing to the sec­ondary dat­a­cen­ter) are first of all very rare. Of the 64 break­downs record­ed since we began our sta­tus page, only 3 have last­ed more than 30 min­utes and could have been eli­gi­ble for switch­ing to the sec­ondary datacenter
  • switch­ing isn’t instan­ta­neous: the sec­ondary serv­er has to take over, and the DNS espe­cial­ly have to be updat­ed. In prac­tice, with HTTP, near­ly 30 min­utes are need­ed for the major­i­ty of new con­nec­tions to be estab­lished on the new IP
  • con­se­quence of the pre­vi­ous point: switch­ing can be counter-pro­duc­tive. If we decide to switch after 30 min­utes of down­time, but the issue is resolved 5 min­utes lat­er, we would have to switch back to the pri­ma­ry dat­a­cen­ter. At the end of the day, it would have been bet­ter to do noth­ing. Besides, the major­i­ty of down­times which last more than 30 min­utes last less than 60 minutes
  • switch­ing has risks of split-brain if you can’t ensure that the pri­ma­ry machine is out of ser­vice. This makes switch­ing rather risky, notably when there is a net­work outage

In the end, even if the http4 down­time would have jus­ti­fied switch­ing (long down­time, serv­er out of ser­vice), we would­n’t call into ques­tion our deci­sion to stop the redundancy.

We made an error though: we should have pub­licly announced the stop­ping of this redun­dan­cy, albeit par­tial (some servers still have redun­dan­cy at the present time). From now on it’s a done deed.

Other forms of redundancy in the future?

If today we stop the redun­dan­cy via DRBD, it is pos­si­ble that in the future we will head towards oth­er forms of redun­dan­cy — for exam­ple, data­base repli­ca­tion or a sys­tem­at­ic syn­chro­ni­sa­tion of filesys­tems. The aim would not be, how­ev­er, just to improve avail­abil­i­ty but to offer oth­er advan­tages, for exam­ple in terms of performances.

Let’s be clear that remov­ing the redun­dan­cy in real time has strict­ly no link with back­ups, still done dai­ly and kept for 30 days.

In conclusion

We offer all our apolo­gies to our cus­tomers for this excep­tion­al issue. Those who were affect­ed can open a tick­et request­ing total reim­burse­ment for the month of October, as stat­ed in our terms of use.

Let’s remem­ber our ded­i­ca­tion to offer­ing the best uptime pos­si­ble, even in a shared envi­ron­ment. It’s a dai­ly chal­lenge, bear­ing in mind the great flex­i­bil­i­ty that we offer (e.g. open­ing an account with­out pay­ment, being able to run any process on our servers).

To promise that there will be no more seri­ous down­times would be unre­al­is­tic: nobody can guar­an­tee that. What we promise is total trans­paren­cy, dur­ing the inci­dent and after­wards, with con­crete steps tak­en to avoid a recur­rence. And answers to your ques­tions, if you have any.