Blog Roll
Strategic Alliance Spotlight
Social
Social
« Sometimes Virtual Is Not The Answer | Main | Growth »
Wednesday
Jun232010

Another (bigger) Crash

This week has been a test of endurance and tranquility. 

Photo by BarkIt started out innocently enough, as do most horrific weeks. Monday morning was spent going through my checks, making sure everything was ok. 

The first call was at about noon on Monday. A client's user was having trouble opening a file in their documents folder... there were indications that seemed to point to the file being corrupted. This is obviously a concern, but not earth-shattering. I'll just finish up what I'm doing and... 

The second call wasn't far behind it. This was another user. They were having trouble opening any files in their documents folder. Since the workstations on this network are configured to redirect their documents folder to a shared location, this now tells me that we have corruption on the server, in the user shared directories. Oh, crap. Now it's getting serious, and it means that everyone is affected. It's time to rearrange the schedule and... 

By the third through fifth calls, I was already on my way. Fast. Fast enough to be noticed by the Highway Patrol. Great.

On my way again with my pink copy of the ticket. Short stop, maybe 5 minutes. He was just doing his job, and I only have myself to blame. Still, it's natural to snarl at my luck, isn't it?

So the first place you look for salvation is in the backups. The backups have been running great! Oh, except for the user shared directories. Those haven't been backed up for some time - Volume corruption on the drive has been keeping them from working. The backup program (Zmanda, for those that keep track of such things) has seen the volume as being much larger than the available space to which it will be backed up, thanks to the corruption. So, no backup of user directories for you. Notifications have been sent out, but have been blocked thanks to a hyperactive spam filter. 

Repairing the volume was easy enough... but several thousand files were either damaged or completely removed in the process. And no backups for at least 6 weeks. Fantastic.

And yet, just about all of the user data (save for 2 files, that we can tell) has been recovered on Tuesday. How? Thanks to a backup solution that you never EVER want to rely on. Offline Folders. Offline Folders were created in the event that you lose connection with your server - you can still work on your files, edit them, save them, and wait for the server to come back. We had enabled Offline Folders when we set up the network to act as a buffer against possible network instability - users could keep working while we fixed the network issues. Offline Folders are stored in your Windows directory in a (typically) hidden folder named CSC. They are stored in binary (unreadable) form, and you will need a special program (csccmd) to extract them. Here's the process that I used:

 

  • DO NOT LOG OFF OR TURN OFF THE CLIENT COMPUTER
  • Download CSCCMD
  • Unplug the computer from the network
  • Create a folder: C:\Restore
  • Extract CSCCMD.EXE into the C:\Restore directory
  • Open a command prompt
  • Change directory into your restore directory (assuming your command prompt starts you on the C:\ drive) : cd \restore
  • Run the following command from the c:\restore directory: csccmd /extract /target:C:\Restore /recurse

 

This will create a folder structure into which the (previously) synchronized files are placed. Once the redirected folders are repaired, you can repopulate them with these restored files. 

This is *not* a disaster recovery plan. This is scrambling back up the cliff face you fell off of hoping the dental floss you're climbing can hold your weight. 

As is the case with most catastrophes, this crash was the sum total of the right blend of smaller failures. In our case, this is what happened:

 

  • Some of the RAM in the server failed
  • Bad data began to be written to the server
  • Volume became corrupted
  • Backups were unable to back up the volume
  • Notifications were blocked

 

 Here are the corrective actions that we are taking:

 

  • Replace the RAM
  • CHKDSK the volumes
  • Verify that backups are working by restoring a sampling of files. Do this frequently (in our case, it will be weekly). The importance of this cannot be overstated
  • Backups are synchronized with an offsite server - we will place a server in our Internet Service Provider's data center (also known as co-locating or co-lo) and all backups will exist in both locations at nearly the same time. Alternately, synchronization with an Amazon S3 drive set can be accomplished quite simply. For our purposes, though, the amount of data that would need to be sync'ed is over 1.5 Terabytes... Cheaper to buy your own server and co-lo it
  • Notifications are tested every time a restore test is run - in our case, weekly

 

So, yes, we dodged a bullet. It was a very tangible reminder that we need to review the backup processes frequently and make sure they are running correctly. Not only for this client, but all clients' backup processes. This will make an already busy week even busier, but it certainly beats the alternative.

 

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>