A few weeks ago I launched the
WonderProxy website, one of the first decisions I made when working on it was the network status bar shown on the left side. That's a nearly live showing of the state of all the servers in the network (as of when the page loads).
What I Did
The box with Nagios running on it runs real tests against all the proxies in the network on a regular basis. I do mean real tests: it uses curl to make a request against a script on a different web server, which checks the headers to ensure that the correct headers are being used. This checks: the proxy is running, the proxy is listening on the correct port, the proxy is able to authenticate users, the proxy is able to retrieve remote documents and return them. Every minute the Nagios box collates the data for all the servers and creates a .json document. That document is then pushed to the web serving machine, ready for consumption by obliging clients.
Note, I've said "what I did", in reality the JavaScript ninjary was done by Christine, and the curl ninjary was done by my obliging sys-admin Will
Why I Did It
Presently, I deal with upwards of twenty hosts between professional and personal projects, that's a lot. I've probably dealt with another ten that I’ve terminated relationships over my professional career. All in all, I’m sick of the hoops that they make me jump through to find out what's going wrong.
Here's a typical outage for me:
0 minutes - Nagios notifies me that something is wrong
+2 - I've logged into nagios, and attempted to log into the affected system for more information, confirming the outage
+9 - I'm on the website for the host of the affected system, finding absolutely no helpful information.
+10 - I've started trying to figure out what my login credentials are for the affected system's hosting provider.
+11 - I've logged in, found no more information
+12 - File a ticket with the host
...
+35 - Nagios informs me that the service is back up
+90 - Host replies with some generic "something went wrong on our end, we fixed it, we may credit your account with $0.12 later" message.
I argue this is patently absurd. Some companies put on a front about being better about this sort of thing, but then an outage occurs and the one girl/guy with access to the twitter account is on vacation, or asleep, or at the movies so nothing happens.
When my box goes down its either my fault, in which case I need to start working my rear end off to bring it back up, or the host is experiencing a real outage, in which case I should just sit on my hands and wait. In that typical listing I've indicated above I don’t know which situation I’m in until 90 minutes on, though I can start poking at logs 35 minutes in.
What I Hope This Means
If you're a WonderProxy customer running tests and something breaks, you can log onto the WonderProxy website and know immediately if the proxy is broken, or if you should be looking harder at your own infrastructure.
Thoughts? Does your host do better?