On Wednesday, July 17, 2013 at 12:02 am (UTC + 2), the mysql1 server experienced a file system corruption. All of the databases hosted on this server were unavailable overnight and a small portion of these were victims of data corruption.
This post addresses the incident in detail: What happened, and the lessons learned from it. We’ve tried to use terms that can be understood by everyone, while maintaining as many technical details as possible.
You can also read over the direct report from the incident.
Let’s begin by establishing that the hardware of the mysql1 server was changed on July 7, in order to equip it with both solid-state and mechanical drives. The server had previously only been fitted with SSDs, but the rising disk space requirements justified this evolution.
bcache is a technology recently integrated in Linux that allows for a combination of SSDs and mechanical drives to gain important disk space and performance close to that of SSDs. We revisit in detail the choice to use this technology in the second part of this post.
July 16, at 3:29 am, one of the disks on the server encountered a hardware problem. The disk was automatically ejected from RAID1, and the server continued to function normally on the second disk — that’s the whole point of RAID. This sort of problem isn’t rare. Two possibilities:
- either the disk is clearly out of order, in which case it would be necessary to replace it
- or the disk encounters a temporary problem, and it’s enough to reset it so that it functions again normally
Our team became aware of the incident that morning — no one on call was alerted since it didn’t impact the proper functioning of the server. We decided to schedule maintenance that evening (it’s an operation that requires shutting down the server for several minutes, so we prefer to do it during off-peak hours).
At 11:52 pm, we restart the mysql1 server. We note that the disk ejected by RAID was working again normally. But immediately we notice a corruption of the file system (XFS) where the databases are stored. After a quick investigation we understand what just happened.
The databases are stored on a bcache “pseudo-drive”, a virtual combination of our mechanical and solid state drives. When Linux boots, bcache scans the set of physical disks looking for a specific signature. This signature, when detected, signifies that the disk makes up part of a bcache pseudo-drive. Bcache then combines the discovered drives and creates the pseudo-drive. These signatures are installed once and for all at the installation of the system.
Our two mechanical disks were in RAID1, which is to say that the two disks were strictly identical and synchronized. That is, of course, the purpose of RAID1 : if one disk breaks down, the other takes over, since it’s an exact copy. In the case of mysql1, at 11:52 pm, we had then:
- the failing disk that stayed in its state from 3:29 am (with “expired” data)
- the second disk that contained the “true” data, that is to say the most recent data.
But the two disks had, from bcache’s point of view, identical signatures. When the server restarted, bcache then assembled the SSD with the failed disk (the first one found) rather than the second up-to-date disk. Yet the SSD contained recent data. So we had a pseudo-drive that was the combination of some data from 11:52 pm (SSD) and some data from 3:29 am (failed disk).
Obviously, this pseudo-drive didn’t have coherent data, which caused file corruption. We immediately disconnected the file system to recombine the SSD with the second disk, with its up-to-date data. But it was too late : the simple act of having previously loaded the pseudo-drive with the failed disk led to writing on the SSD. The data on the SSD wasn’t then compatible with the second disk.
Despite the file system corruption, a very large amount of data remained readable. As such, we made the decision to restore as much of the up-to-date data as possible, rather than restoring the backups from the night before for everyone.
The first step was to copy all of the data (300 GB) onto a temporary server. For security, we copied the the data a second time onto a second temporary server. The data on the first server was used to run MySQL (only internally accessible) and try to recover the maximum possible amount of data, which meant modifying the files. The data on the second server was only there to allow us to restart the process, if necessary.
The data restoration (which essentially consisted of executing “mysqldump” on every database) took time: the corruption led MySQL to crash a number of times, slowing the operation. Once we had an SQL dump for the large majority of the databases (96%), we were able to start the final restoration on mysql1 (reformatted in the meantime). The 4% of corrupted databases were then restored from the backups from the night before.
In certain rare cases, recovered data that we considered whole was in the end partly corrupted. Affected clients were able to contact us so that we could restore a backup.
Let’s start right off by saying that loss of data is a risk that is impossible to reduce to zero. Causes can be varied: bugs, human error, hardware problems, major accidents, etc. It is for this reason that backups exist. That’s not to say, however, that we can’t better reduce risks.
This incident is due to a bug in bcache. It is therefore legitimate to ask ourselves if the use of bcache, a relatively young technology, was wise.
As a general rule, the younger a technology, the greater the risk it presents from malfunctions (minor or major). So it’s necessary to judge if the advantages the technology brings are worth the cost in risk. Naturally, this evaluation evolves over time: the more a technology matures, the more the risks diminish.
In regards to bcache, the advantages are important: it allows considerable improvement in performance, without sacrificing disk space. Remember that unlike the majority of hosting providers, we impose no limits on the size of databases.
We’ve been following the evolution of hybrid SSD/mechanical disk technologies for quite some time. There are many of them apart from bcache: dm-cache, EnhanceIO, FlashCache, CacheCade, btier, to name just a few. At the beginning of the year, we compared and “lab tested” many of these solutions. At the end of these tests we decided to use bcache on a pseudo-production server: http11, onto which we moved a handful of voluntary clients. For 4 months, the server operated with bcache without showing the slightest problem, all while achieving promised performance increases.
We should make clear that bcache, developed by an engineer at Google, has existed since 2010, and is considered stable by its creator since 2011 — date of the last discovered bug able to corrupt data. Used in production by numerous individuals and organizations since, it was integrated in Linux in 2013 at the end of a long and rigorous code review by many senior developers at Linux.
We absolutely stand by our decision to use bcache in production and we don’t consider it a mistake. The bug, that we since reported to the developers, was corrected less than 2 hours afterwards, on a Sunday evening. Such responsiveness is just a bonus feature of bcache.
Two hardware elements indirectly led to the bug:
- the fact that the disk encountered a temporary problem prompting its ejection from RAID
- the fact that our chassis aren’t equipped to hot swap, requiring us to restart the server in order to reset the disk (or to unplug the server to replace a disk)
We are going to replace our set of servers over the coming months. Among other advantages, they will become even more reliable than they currently are. We will have the opportunity to come back to the details of this move in several weeks.
Many clients wish that we offered a master/slave redundancy option that would mitigate certain types of failure, such as the incident we had. This option necessarily carries a non-negligible cost.
Let us know in the comments if you would be interested in such an option, which would raise the price of a plan by about 50%.
About 300 MySQL databases were corrupted and had to be restored to the backups from the night before, which represents less than 4% of databases stored on mysql1, and less than 1.5% of the the total number of MySQL databases. Each affected client was contacted by mail, and we again apologize.
We are preparing large internal changes intended to further improve our already high level of reliability. You will know more about these changes very soon.
Don’t hesitate to send us your replies or suggestions in the comments. We’re always open to them. We also want to thank all of those who offered support during this incident; it’s always appreciated.
Finally, we want to wish a great time to any of you about to go on vacation, and good luck to those coming back! :)