I invested a portion of my day heading down to our data centre, and yanking a drive out of our newly racked server, and watching to see what would happen. The answer was: not much. The system kept merrily computing along, but it also failed to warn us that it had lost a drive. Some configuration tweaks later, I repeated the process, and my mailbox was quickly filled. Success.

I did this for two reasons: I wanted assurance that our raid controller had been properly configured, and that things would continue operating normally if we lost a drive. I also wanted to ensure that should we lose a drive we'd be notified, redundancy is worth little if actions aren't taken to correct problems that cause it to be lost. Today confirmed the first, and revealed problems in the second. Huge Success.

If you're going to perform the same task, a few pieces of advice:

  • Do it before there's critical production software running on it, at least the first time
  • Your raid controller will need to rebuild the drive after it's been pulled. This will take a while, and kills your redundancy (or a portion thereof) while it's happening. If you need to repeat the test pull the same drive each time to avoid real problems.
  • Work to ensure that problem notifications leave the affected system as quickly as possible. If the only copy of an alert sits on the box that's having issues, you may lose it as well.


Comments »

No Trackbacks
No comments

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
 

Hi, I’m Paul Reinheimer, a developer working on the web.

I co-founded WonderProxy which provides access to over 200 proxies around the world to enable testing of geoip sensitive applications. We've since expanded to offer more granular tooling through Where's it Up

My hobbies are cycling, photography, travel, and engaging Allison Moore in intelligent discourse. I frequently write about PHP and other related technologies.

Search