TWCenter site down - team post here

**Paedric** · 07-07-2010, 16:53

A few people complain, so I put it in the scroll list at the bottom. It's cool everything is working.

Here is an update of what happened with TWC.

Originally Posted by Simetrical

Okay, so here's basically what happened. On Friday we were moving all our data to some new disks. Friday night, this failed somehow, and the server crashed and didn't restart properly. I'm never available Friday night or Saturday. Saturday night, I tried to get it to work, but it wouldn't boot fully, so I couldn't access it remotely. I needed to wait for the on-site guy, GrnEyedDvl (GED). Unfortunately, he was a on a three-day vacation.

So Monday night he got back, and late yesterday afternoon (after the site had been down for almost four days) he went down to the server room. We got the machine to boot, and I started trying to figure out what went wrong and piece things back together.

Unfortunately, it seems like the new disks we were switching to were bad. The non-root filesystems suffered repeated massive corruption. Thankfully, by luck, the root filesystem was not moved to the new disks, so it's intact, and I'm able to administer the system remotely.

All data on the system itself is lost, but we made sure to pull one of the old disks with all the data on it as a backup. We did this several hours before the crash, so we'll only lose several hours of data. Currently the backup disk is sitting at GED's house attached to a test machine, and I'm having it copy everything to another disk so we have an extra backup. Once that's done, probably sometime this afternoon, GED will take the old disk back to the data center and we'll resync to that and restart the site. The site should then work, although it will be slow, as it's been the last week or so before the crash (since one of the VelociRaptors failed and we moved everything essential to the 7200 RPM disks).

Meanwhile, I'm working on figuring out which VelociRaptors are bad, by running SMART tests and also read-write badblocks tests. We know one is bad, but one of the other two seems to be as well, and we need to confirm that before we proceed. Once we figure out what Raptors are usable, we can consider moving everything to them again. Or maybe setting up the new server and just copying everything, so we don't ever have only one copy of anything.

Lessons learned:

Have redundant servers. This would have been a total nonissue if we'd gotten the second server set up before this happened. We could have just transferred all the traffic there, maybe even automatically. (Most of the parts have arrived, but not all, and it's not actually installed at the data center yet. Thanks to everyone who donated!)
Always keep good backups. If not for the disk we pulled, we'd have to restore a weeks-old backup. This is bad, but not as bad as it could be. It pays to be paranoid when it comes to data integrity.
Check new disks for errors before using them. I'm definitely going to be doing extended SMART self-tests and destructive read-write badblocks tests on every new disk we get before we use it.

Thanks to the Guild for letting TWC members use this forum to talk about the downtime. (I used to actually post here quite a lot, as you can see from my post count. Then I became a mod and eventually admin at TWC, so . . .) You can also go to the TWC IRC chat at http://java.surrealchat.net/chat/twcenter while the site is down.

Originally Posted by GrnEyedDvl

I am honestly not impressed with these. Thor (main TWC server) originally had 2 Raptors. About two weeks ago one (dev/sdc) of them threw an error and we had to rebuild the RAID array. It did rebuild and pass the SMART testing but it still concerned us and I planned on replacing it. When I ordered the drives for the second server that we are building, I ordered five. Four for the new machine and one to replace sdc in thor. The day before they arrived /dev/sdd in thor completely failed. I cant get into it at all.

We replaced that one (dev/sdd) with the drive I bought to replace /dev/sdc, which meant that I was now a drive short for the new server since /dev/sdc still needs to be replaced. I can get it replaced under warranty and run the new server on 3 drives, which was the plan, but it still threw a kink in our plans.

So since we were kind of hosed anyways we decided to go ahead and swap out the slower 500 gig drives with the faster Raptor drives. We were doing this one at a time so the array could rebuild. That is what we were doing Friday night. I pulled one of the 500 gig drives and replaced it with the Raptor and we were moving files when it locked up. This is now the third Raptor to have issues, and its a brand spanking new one. The other two were brand new in January.

The read/write speed is ridiculously faster on the Raptors, and we do a TON of database writes.

Thread: TWCenter site down - team post here

Thread Tools

Display

Threaded View

Re: TWCenter site down - team post here

Bookmarks

Bookmarks

Posting Permissions