This Monday, February 7, between 3:48 p.m. and 9:00 p.m., the email, FTP and WebDAV services and our own sites have been widely unavailable. All other services have not been affected, including websites and databases of our clients, DNS servers, SSH access. No data was lost.
The precise sequence of this special failure — the longest in our history — is detailed later in this post. We will also get back on the measures to be taken in order to prevent this from happening again.
At about 3:45 p.m., I began to update the PostgreSQL tools (including pg_dump) in version 9.0 on the server hosting the *. alwaysdata.com sites (our own sites, not the sites of our customers) and our production database.
To do this, I must first remove the tools from version 8.4; but the removal of these tools will delete PostgreSQL itself (the server). Despite the confirmation prompt, I validated it, mistakenly thinking that they are common files and not part of the PostgreSQL server. This terrible human error led to the 5h failure.
Alas, the removal of PostgreSQL led to — in Debian — deleting the data itself, not just the software (which I think is a very bad idea). Despite an almost immediate CTRL+C, some of the files in our production database were deleted. To maximize the chances of recovery, I stopped everything that was running on the server and unmounted the partition.
At that moment — 3:48 p.m. — our production database was inaccessible. Yet this database, in addition to being used by our sites and our administration interface — is also used for SMTP, IMAP/POP, FTP and WebDAV authentication. Therefore, it is now impossible to connect to a mailbox to send/receive mail or connect to FTP/WebDAV. Users that are already connected (by IMAP or FTP, for example) are not affected.
I quickly realized that the content of our production database is intact, but the files of the PostgreSQL database system are affected. Unfortunately, in this case there is no systematic method in order to regain the access to data; you need to get your hands dirty. It is impossible to estimate the time when it will return to normal: it can take 15 minutes or more.
After nearly two hours with no success, and seeing that the situation is likely to persist, we decided to restore an online backup, with read-only access, used for FTP, WebDAV and POP/IMAP. So at 5:38 p.m., these services are running again, except for the accounts that were created recently (not found in the last backup). SMTP is excluded to avoid denying (and lose) incoming emails for new accounts.
We would still have almost 3 hours before fully restoring the production database without any data loss.
With each failure we encounter, from the smallest to the most important one, we systematically search for the ways to implement in order to eliminate — or at least greatly reduce — the risk of reproducing them.
Let’s start by talking about the origin of the problem: human error. It is possible, in some cases, to minimize chances of making a mistake. We introduced last year certain changes: different development and production devices by adding color to the prompt, installation of a tool that prevents deleting (using rm) critical data. In this case, despite the confirmation, my confusion made me make a disastrous decision. At this level, I see no miracle way to prevent this from happening again.
Thus, we are talking about a technical solution. Our production database, which is crucial for the functioning of many of our services, represents an obvious SPOF (Single Point of Failure). The solution, equally obvious, is to duplicate it. Sad irony, we had already expected that this improvement would happen very soon…
Since PostgreSQL 9.0, a mechanism for synchronizing data in real time is included. That’s what we will use to have a mirror database, read-only, on a second independent server (located in a different datacenter). Thus, in the case of accidentally deleting the database — or any other problem on the primary server — the secondary server will immediately take the relay and no critical services will be disrupted.
This real-time synchronization does not solve another potential problem: that of a deletion of data in SQL (by DELETE or DROP). Any changes to the database on the primary server is instantly made on the secondary server, so the data would be lost permanently.
To overcome this potential problem, we will increase our frequent backups of the database. We will move to a higher frequency — probably every hour, compared to every day until now. Our secondary server will be based on EC2, where the snapshot system will manage automatic backups very easily.
Obviously, we present our most sincere apologies to all our customers for this abnormal failure. Please be sure that your data was at no time threatened.
We have been working for several weeks now exclusively on strengthening our overall stability for all services — this is why there are only a few new features at this time. The deployment of our new architecture, the first reason for downtime in 2010, is now well behind us and 2011 looks very promising from all points of view.
I look forward to posting with more pleasant news :)