Okay, so here's basically what happened. On Friday we were moving all our data to some new disks. Friday night, this failed somehow, and the server crashed and didn't restart properly. I'm never available Friday night or Saturday. Saturday night, I tried to get it to work, but it wouldn't boot fully, so I couldn't access it remotely. I needed to wait for the on-site guy, GrnEyedDvl (GED). Unfortunately, he was a on a three-day vacation.
So Monday night he got back, and late yesterday afternoon (after the site had been down for almost four days) he went down to the server room. We got the machine to boot, and I started trying to figure out what went wrong and piece things back together.
Unfortunately, it seems like the new disks we were switching to were bad. The non-root filesystems suffered repeated massive corruption. Thankfully, by luck, the root filesystem was not moved to the new disks, so it's intact, and I'm able to administer the system remotely.
All data on the system itself is lost, but we made sure to pull one of the old disks with all the data on it as a backup. We did this several hours before the crash, so we'll only lose several hours of data. Currently the backup disk is sitting at GED's house attached to a test machine, and I'm having it copy everything to another disk so we have an extra backup. Once that's done, probably sometime this afternoon, GED will take the old disk back to the data center and we'll resync to that and restart the site. The site should then work, although it will be slow, as it's been the last week or so before the crash (since one of the VelociRaptors failed and we moved everything essential to the 7200 RPM disks).
Meanwhile, I'm working on figuring out which VelociRaptors are bad, by running SMART tests and also read-write badblocks tests. We know one is bad, but one of the other two seems to be as well, and we need to confirm that before we proceed. Once we figure out what Raptors are usable, we can consider moving everything to them again. Or maybe setting up the new server and just copying everything, so we don't ever have only one copy of anything.
Lessons learned:
- Have redundant servers. This would have been a total nonissue if we'd gotten the second server set up before this happened. We could have just transferred all the traffic there, maybe even automatically. (Most of the parts have arrived, but not all, and it's not actually installed at the data center yet. Thanks to everyone who donated!)
- Always keep good backups. If not for the disk we pulled, we'd have to restore a weeks-old backup. This is bad, but not as bad as it could be. It pays to be paranoid when it comes to data integrity.
- Check new disks for errors before using them. I'm definitely going to be doing extended SMART self-tests and destructive read-write badblocks tests on every new disk we get before we use it.
Thanks to the Guild for letting TWC members use this forum to talk about the downtime. (I used to actually post here quite a lot, as you can see from my post count. Then I became a mod and eventually admin at TWC, so . . .) You can also go to the TWC IRC chat at
http://java.surrealchat.net/chat/twcenter while the site is down.
Bookmarks