This Monday, February 7, between 3:48 p.m. and 9:00 p.m., the email, FTP and WebDAV ser­vices and our own sites have been wide­ly unavail­able. All oth­er ser­vices have not been affect­ed, includ­ing web­sites and data­bas­es of our clients, DNS servers, SSH access. No data was lost.

The pre­cise sequence of this spe­cial fail­ure — the longest in our his­to­ry — is detailed lat­er in this post. We will also get back on the mea­sures to be tak­en in order to pre­vent this from hap­pen­ing again.

What happened

At about 3:45 p.m., I began to update the PostgreSQL tools (includ­ing pg_dump) in ver­sion 9.0 on the serv­er host­ing the *. alwaysdata.com sites (our own sites, not the sites of our cus­tomers) and our pro­duc­tion data­base.

To do this, I must first remove the tools from ver­sion 8.4; but the removal of these tools will delete PostgreSQL itself (the serv­er). Despite the con­fir­ma­tion prompt, I val­i­dat­ed it, mis­tak­en­ly think­ing that they are com­mon files and not part of the PostgreSQL serv­er. This ter­ri­ble human error led to the 5h fail­ure.

Alas, the removal of PostgreSQL led to — in Debian — delet­ing the data itself, not just the soft­ware (which I think is a very bad idea). Despite an almost imme­di­ate CTRL+C, some of the files in our pro­duc­tion data­base were delet­ed. To max­i­mize the chances of recov­ery, I stopped every­thing that was run­ning on the serv­er and unmount­ed the par­ti­tion.

At that moment — 3:48 p.m. — our pro­duc­tion data­base was inac­ces­si­ble. Yet this data­base, in addi­tion to being used by our sites and our admin­is­tra­tion inter­face — is also used for SMTP, IMAP/POP, FTP and WebDAV authen­ti­ca­tion. Therefore, it is now impos­si­ble to con­nect to a mail­box to send/receive mail or con­nect to FTP/WebDAV. Users that are already con­nect­ed (by IMAP or FTP, for exam­ple) are not affect­ed.

I quick­ly real­ized that the con­tent of our pro­duc­tion data­base is intact, but the files of the PostgreSQL data­base sys­tem are affect­ed. Unfortunately, in this case there is no sys­tem­at­ic method in order to regain the access to data; you need to get your hands dirty. It is impos­si­ble to esti­mate the time when it will return to nor­mal: it can take 15 min­utes or more.

After near­ly two hours with no suc­cess, and see­ing that the sit­u­a­tion is like­ly to per­sist, we decid­ed to restore an online back­up, with read-only access, used for FTP, WebDAV and POP/IMAP. So at 5:38 p.m., these ser­vices are run­ning again, except for the accounts that were cre­at­ed recent­ly (not found in the last back­up). SMTP is exclud­ed to avoid deny­ing (and lose) incom­ing emails for new accounts.

We would still have almost 3 hours before ful­ly restor­ing the pro­duc­tion data­base with­out any data loss.

Preventing this from happening again

With each fail­ure we encounter, from the small­est to the most impor­tant one, we sys­tem­at­i­cal­ly search for the ways to imple­ment in order to elim­i­nate — or at least great­ly reduce — the risk of repro­duc­ing them.

Let’s start by talk­ing about the ori­gin of the prob­lem: human error. It is pos­si­ble, in some cas­es, to min­i­mize chances of mak­ing a mis­take. We intro­duced last year cer­tain changes: dif­fer­ent devel­op­ment and pro­duc­tion devices by adding col­or to the prompt, instal­la­tion of a tool that pre­vents delet­ing (using rm) crit­i­cal data. In this case, despite the con­fir­ma­tion, my con­fu­sion made me make a dis­as­trous deci­sion. At this lev­el, I see no mir­a­cle way to pre­vent this from hap­pen­ing again.

Thus, we are talk­ing about a tech­ni­cal solu­tion. Our pro­duc­tion data­base, which is cru­cial for the func­tion­ing of many of our ser­vices, rep­re­sents an obvi­ous SPOF (Single Point of Failure). The solu­tion, equal­ly obvi­ous, is to dupli­cate it. Sad irony, we had already expect­ed that this improve­ment would hap­pen very soon…

Since PostgreSQL 9.0, a mech­a­nism for syn­chro­niz­ing data in real time is includ­ed. That’s what we will use to have a mir­ror data­base, read-only, on a sec­ond inde­pen­dent serv­er (locat­ed in a dif­fer­ent dat­a­cen­ter). Thus, in the case of acci­den­tal­ly delet­ing the data­base — or any oth­er prob­lem on the pri­ma­ry serv­er — the sec­ondary serv­er will imme­di­ate­ly take the relay and no crit­i­cal ser­vices will be dis­rupt­ed.

This real-time syn­chro­niza­tion does not solve anoth­er poten­tial prob­lem: that of a dele­tion of data in SQL (by DELETE or DROP). Any changes to the data­base on the pri­ma­ry serv­er is instant­ly made on the sec­ondary serv­er, so the data would be lost per­ma­nent­ly.

To over­come this poten­tial prob­lem, we will increase our fre­quent back­ups of the data­base. We will move to a high­er fre­quen­cy — prob­a­bly every hour, com­pared to every day until now. Our sec­ondary serv­er will be based on EC2, where the snap­shot sys­tem will man­age auto­mat­ic back­ups very eas­i­ly.

In conclusion

Obviously, we present our most sin­cere apolo­gies to all our cus­tomers for this abnor­mal fail­ure. Please be sure that your data was at no time threat­ened.

We have been work­ing for sev­er­al weeks now exclu­sive­ly on strength­en­ing our over­all sta­bil­i­ty for all ser­vices — this is why there are only a few new fea­tures at this time. The deploy­ment of our new archi­tec­ture, the first rea­son for down­time in 2010, is now well behind us and 2011 looks very promis­ing from all points of view.

I look for­ward to post­ing with more pleas­ant news 🙂