On Wednesday, July 17, 2013 at 12:02 am (UTC + 2), the mysql1 serv­er expe­ri­enced a file sys­tem cor­rup­tion. All of the data­bas­es host­ed on this serv­er were unavail­able overnight and a small por­tion of these were vic­tims of data cor­rup­tion.

This post address­es the inci­dent in detail: What hap­pened, and the lessons learned from it. We’ve tried to use terms that can be under­stood by every­one, while main­tain­ing as many tech­ni­cal details as pos­si­ble.

You can also read over the direct report from the inci­dent.

What happened

Use of bcache

Let’s begin by estab­lish­ing that the hard­ware of the mysql1 serv­er was changed on July 7, in order to equip it with both sol­id-state and mechan­i­cal dri­ves. The serv­er had pre­vi­ous­ly only been fit­ted with SSDs, but the ris­ing disk space require­ments jus­ti­fied this evo­lu­tion.

bcache is a tech­nol­o­gy recent­ly inte­grat­ed in Linux that allows for a com­bi­na­tion of SSDs and mechan­i­cal dri­ves to gain impor­tant disk space and per­for­mance close to that of SSDs. We revis­it in detail the choice to use this tech­nol­o­gy in the sec­ond part of this post.

Disk failure

July 16, at 3:29 am, one of the disks on the serv­er encoun­tered a hard­ware prob­lem. The disk was auto­mat­i­cal­ly eject­ed from RAID1, and the serv­er con­tin­ued to func­tion nor­mal­ly on the sec­ond disk — that’s the whole point of RAID. This sort of prob­lem isn’t rare. Two pos­si­bil­i­ties:

  • either the disk is clear­ly out of order, in which case it would be nec­es­sary to replace it
  • or the disk encoun­ters a tem­po­rary prob­lem, and it’s enough to reset it so that it func­tions again nor­mal­ly

Our team became aware of the inci­dent that morn­ing — no one on call was alert­ed since it didn’t impact the prop­er func­tion­ing of the serv­er. We decid­ed to sched­ule main­te­nance that evening (it’s an oper­a­tion that requires shut­ting down the serv­er for sev­er­al min­utes, so we pre­fer to do it dur­ing off-peak hours).

Restarting the server

At 11:52 pm, we restart the mysql1 serv­er. We note that the disk eject­ed by RAID was work­ing again nor­mal­ly. But imme­di­ate­ly we notice a cor­rup­tion of the file sys­tem (XFS) where the data­bas­es are stored. After a quick inves­ti­ga­tion we under­stand what just hap­pened.

The data­bas­es are stored on a bcache “pseu­do-dri­ve”, a vir­tu­al com­bi­na­tion of our mechan­i­cal and sol­id state dri­ves. When Linux boots, bcache scans the set of phys­i­cal disks look­ing for a spe­cif­ic sig­na­ture. This sig­na­ture, when detect­ed, sig­ni­fies that the disk makes up part of a bcache pseu­do-dri­ve. Bcache then com­bines the dis­cov­ered dri­ves and cre­ates the pseu­do-dri­ve. These sig­na­tures are installed once and for all at the instal­la­tion of the sys­tem.

Our two mechan­i­cal disks were in RAID1, which is to say that the two disks were strict­ly iden­ti­cal and syn­chro­nized. That is, of course, the pur­pose of RAID1 : if one disk breaks down, the oth­er takes over, since it’s an exact copy. In the case of mysql1, at 11:52 pm, we had then:

  • the fail­ing disk that stayed in its state from 3:29 am (with “expired” data)
  • the sec­ond disk that con­tained the “true” data, that is to say the most recent data.

But the two disks had, from bcache’s point of view, iden­ti­cal sig­na­tures. When the serv­er restart­ed, bcache then assem­bled the SSD with the failed disk (the first one found) rather than the sec­ond up-to-date disk. Yet the SSD con­tained recent data. So we had a pseu­do-dri­ve that was the com­bi­na­tion of some data from 11:52 pm (SSD) and some data from 3:29 am (failed disk).

Obviously, this pseu­do-dri­ve didn’t have coher­ent data, which caused file cor­rup­tion. We imme­di­ate­ly dis­con­nect­ed the file sys­tem to recom­bine the SSD with the sec­ond disk, with its up-to-date data. But it was too late : the sim­ple act of hav­ing pre­vi­ous­ly loaded the pseu­do-dri­ve with the failed disk led to writ­ing on the SSD. The data on the SSD wasn’t then com­pat­i­ble with the sec­ond disk.

Data restoration

Despite the file sys­tem cor­rup­tion, a very large amount of data remained read­able. As such, we made the deci­sion to restore as much of the up-to-date data as pos­si­ble, rather than restor­ing the back­ups from the night before for every­one.

The first step was to copy all of the data (300 GB) onto a tem­po­rary serv­er. For secu­ri­ty, we copied the the data a sec­ond time onto a sec­ond tem­po­rary serv­er. The data on the first serv­er was used to run MySQL (only inter­nal­ly acces­si­ble) and try to recov­er the max­i­mum pos­si­ble amount of data, which meant mod­i­fy­ing the files. The data on the sec­ond serv­er was only there to allow us to restart the process, if nec­es­sary.

The data restora­tion (which essen­tial­ly con­sist­ed of exe­cut­ing “mysql­dump” on every data­base) took time: the cor­rup­tion led MySQL to crash a num­ber of times, slow­ing the oper­a­tion. Once we had an SQL dump for the large major­i­ty of the data­bas­es (96%), we were able to start the final restora­tion on mysql1 (refor­mat­ted in the mean­time). The 4% of cor­rupt­ed data­bas­es were then restored from the back­ups from the night before.

In cer­tain rare cas­es, recov­ered data that we con­sid­ered whole was in the end part­ly cor­rupt­ed. Affected clients were able to con­tact us so that we could restore a back­up.

Diminishing risk in the future

Let’s start right off by say­ing that loss of data is a risk that is impos­si­ble to reduce to zero. Causes can be var­ied: bugs, human error, hard­ware prob­lems, major acci­dents, etc. It is for this rea­son that back­ups exist. That’s not to say, how­ev­er, that we can’t bet­ter reduce risks.

Use of new technology

This inci­dent is due to a bug in bcache. It is there­fore legit­i­mate to ask our­selves if the use of bcache, a rel­a­tive­ly young tech­nol­o­gy, was wise.

As a gen­er­al rule, the younger a tech­nol­o­gy, the greater the risk it presents from mal­func­tions (minor or major). So it’s nec­es­sary to judge if the advan­tages the tech­nol­o­gy brings are worth the cost in risk. Naturally, this eval­u­a­tion evolves over time: the more a tech­nol­o­gy matures, the more the risks dimin­ish.

In regards to bcache, the advan­tages are impor­tant: it allows con­sid­er­able improve­ment in per­for­mance, with­out sac­ri­fic­ing disk space. Remember that unlike the major­i­ty of host­ing providers, we impose no lim­its on the size of data­bas­es.

We’ve been fol­low­ing the evo­lu­tion of hybrid SSD/mechanical disk tech­nolo­gies for quite some time. There are many of them apart from bcache: dm-cache, EnhanceIO, FlashCache, CacheCade, bti­er, to name just a few. At the begin­ning of the year, we com­pared and “lab test­ed” many of these solu­tions. At the end of these tests we decid­ed to use bcache on a pseu­do-pro­duc­tion serv­er: http11, onto which we moved a hand­ful of vol­un­tary clients. For 4 months, the serv­er oper­at­ed with bcache with­out show­ing the slight­est prob­lem, all while achiev­ing promised per­for­mance increas­es.

We should make clear that bcache, devel­oped by an engi­neer at Google, has exist­ed since 2010, and is con­sid­ered sta­ble by its cre­ator since 2011 — date of the last dis­cov­ered bug able to cor­rupt data. Used in pro­duc­tion by numer­ous indi­vid­u­als and orga­ni­za­tions since, it was inte­grat­ed in Linux in 2013 at the end of a long and rig­or­ous code review by many senior devel­op­ers at Linux.

We absolute­ly stand by our deci­sion to use bcache in pro­duc­tion and we don’t con­sid­er it a mis­take. The bug, that we since report­ed to the devel­op­ers, was cor­rect­ed less than 2 hours after­wards, on a Sunday evening. Such respon­sive­ness is just a bonus fea­ture of bcache.

Improving the reliability of our hardware

Two hard­ware ele­ments indi­rect­ly led to the bug:

  • the fact that the disk encoun­tered a tem­po­rary prob­lem prompt­ing its ejec­tion from RAID
  • the fact that our chas­sis aren’t equipped to hot swap, requir­ing us to restart the serv­er in order to reset the disk (or to unplug the serv­er to replace a disk)

We are going to replace our set of servers over the com­ing months. Among oth­er advan­tages, they will become even more reli­able than they cur­rent­ly are. We will have the oppor­tu­ni­ty to come back to the details of this move in sev­er­al weeks.

Add an option of redundancy

Many clients wish that we offered a master/slave redun­dan­cy option that would mit­i­gate cer­tain types of fail­ure, such as the inci­dent we had. This option nec­es­sar­i­ly car­ries a non-neg­li­gi­ble cost.

Let us know in the com­ments if you would be inter­est­ed in such an option, which would raise the price of a plan by about 50%.

In conclusion

About 300 MySQL data­bas­es were cor­rupt­ed and had to be restored to the back­ups from the night before, which rep­re­sents less than 4% of data­bas­es stored on mysql1, and less than 1.5% of the the total num­ber of MySQL data­bas­es. Each affect­ed client was con­tact­ed by mail, and we again apol­o­gize.

We are prepar­ing large inter­nal changes intend­ed to fur­ther improve our already high lev­el of reli­a­bil­i­ty. You will know more about these changes very soon.

Don’t hes­i­tate to send us your replies or sug­ges­tions in the com­ments. We’re always open to them. We also want to thank all of those who offered sup­port dur­ing this inci­dent; it’s always appre­ci­at­ed.

Finally, we want to wish a great time to any of you about to go on vaca­tion, and good luck to those com­ing back! 🙂