<?paul

A Customer found a Bug - Redux

Wednesday, January 20. 2010

A little over a week ago I posted about a customer discovering a bug on the WonderProxy website, and how I handled the situation. I wanted to speak to that once more, in part as a follow up based on feedback I received, and in part about why I didn’t panic that a customer had found a rather critical problem in my website.

First, the follow up.

The overwhelming opinion is that I did okay in handling the issue, but I could have done better. Namely, I should have emailed the customer as soon as I confirmed the issue. That way the customer would be in the loop throughout the entire process. The fact that resolving the issue only took a few minutes was a mitigating factor, but really I failed to keep the customer in the loop, and that earlier notification could have allowed them to better plan use of my service.

In the future, I’ll behave differently. I have no intention of setting up auto-responders, but I can, and will, do better in terms of keeping customers in the loop.

Second, not Panicking.

Customer Service is a big topic, there’s a tonne of research out there on customer satisfaction, their interaction with customer service, etc. The body of the research shows that customers who have problems that are resolved to their satisfaction are on the whole happier than customers who had no problems at all. Keeping that in mind, I didn’t panic. This was an opportunity to have a very happy customer, rather than a crisis.

The research I read was admittedly based on hotels (with an interesting anecdote about discovering a customer who over something like 207 nights of stay had not yet paid for one, as every time they complained about something and the night was comp’d). But the data resounds well with me. We’ve all got a plethora of customer service horror stories, and precious few great ones. As a result, those great customer service experiences are carried with us, and often shared with friends and colleagues.

While I don’t intend to go as far as to introduce bugs into the system simply so I can fix them in a fantastic manner, I will embrace every opportunity I get to show a customer great service.

X-Host-IP

Tuesday, January 12. 2010

When you discover a problem on you website, and you run multiple web servers, one of the first questions you ask is "Which server am I on" (often followed immediately by: "Does it happen on all of them").

This presented an unfortunate problem for WonderProxy customers, as when they connect to a server through their proxy, they are no longer able to determine which webserver provided a resource. That information is hidden because their connections take place through the proxy.

This is why we're happy to announce the introduction of a new HTTP header being served through all 18 of our proxies: X-Host-IP. This header is injected by our proxy software when handing data back to the client. This header includes the IP address of the machine that provided the resource, allowing clients to obtain this information through a variety of methods like Live HTTP Headers, Firebug, or Dragonfly.

This is the first of two steps we’re taking to better inform, and empower our clients when it comes to servers. We’re planning to roll out a new feature allowing customers to inject a hosts file[1] onto the proxy server of their choice to allow server selection through a web interface. This should be completed in the next few weeks.

[1] A hosts file (/etc/hosts on most linux machines, and \system32\drivers\etc\hosts.txt on Windows boxes) allows users to hard code hostname and IP combinations, often used for testing and development.

A Customer found a Bug

Tuesday, January 12. 2010

So this morning I woke up to an email (well, several, but only one is relevant) from a WonderProxy customer. They had discovered a bug in the website. A rather serious bug mind you, as it stopped them from adding servers to their account, and thus prevented them from actually using the service.

Here’s how I approached the problem:

Reproduce the issue
I logged into the website as the customer and attempted to reproduce the issue, no problems. I tried again on the development version of the website and reproduced it there as well.

Estimate bug fix
I took a look at the code and estimated that I could resolve the bug in under ten minutes. The code base for the WonderProxy website is rather small, and bugs there can’t affect customer usage (unless they break the DB) so I wasn’t too concerned about accidentally taking something down. The short time frame for release also let me fix the bug before contacting the customer.

Fix the bug
Found the bug, fixed it. Tested it out in development, checked other accounts to look for unrelated breaks. Finally I rolled it out from development to production and checked again.

Contact Customer
Replying to the customer I indicated that problem was indeed on our end (the customer wasn’t using the site wrong or anything), and that it had been fixed. Additionally I had updated the billing period to reflect the fact that the account hadn’t been usable up until the fix was applied. As a token of appreciation I upgraded the account slightly.

Overall, I found the process rather pain free. When I look at it from a customer’s prospective, I also feel like it was a reasonable problem resolution.

I’d like to delve deeper into how this problem made it onto my production website, but that’s a separate post. For now I’m really just looking at the customer service side, there was a total of 67 minutes between the initial contact (on the part of the customer) and my reply with the fix in place.

Thoughts? How did I do?

WonderProxy - Server Status

Monday, January 4. 2010

A few weeks ago I launched the WonderProxy website, one of the first decisions I made when working on it was the network status bar shown on the left side. That's a nearly live showing of the state of all the servers in the network (as of when the page loads).

What I Did

The box with Nagios running on it runs real tests against all the proxies in the network on a regular basis. I do mean real tests: it uses curl to make a request against a script on a different web server, which checks the headers to ensure that the correct headers are being used. This checks: the proxy is running, the proxy is listening on the correct port, the proxy is able to authenticate users, the proxy is able to retrieve remote documents and return them. Every minute the Nagios box collates the data for all the servers and creates a .json document. That document is then pushed to the web serving machine, ready for consumption by obliging clients.

Note, I've said "what I did", in reality the JavaScript ninjary was done by Christine, and the curl ninjary was done by my obliging sys-admin Will

Why I Did It

Presently, I deal with upwards of twenty hosts between professional and personal projects, that's a lot. I've probably dealt with another ten that I’ve terminated relationships over my professional career. All in all, I’m sick of the hoops that they make me jump through to find out what's going wrong.
Here's a typical outage for me:
0 minutes - Nagios notifies me that something is wrong
+2 - I've logged into nagios, and attempted to log into the affected system for more information, confirming the outage
+9 - I'm on the website for the host of the affected system, finding absolutely no helpful information.
+10 - I've started trying to figure out what my login credentials are for the affected system's hosting provider.
+11 - I've logged in, found no more information
+12 - File a ticket with the host
...
+35 - Nagios informs me that the service is back up
+90 - Host replies with some generic "something went wrong on our end, we fixed it, we may credit your account with $0.12 later" message.

I argue this is patently absurd. Some companies put on a front about being better about this sort of thing, but then an outage occurs and the one girl/guy with access to the twitter account is on vacation, or asleep, or at the movies so nothing happens.

When my box goes down its either my fault, in which case I need to start working my rear end off to bring it back up, or the host is experiencing a real outage, in which case I should just sit on my hands and wait. In that typical listing I've indicated above I don’t know which situation I’m in until 90 minutes on, though I can start poking at logs 35 minutes in.

What I Hope This Means

If you're a WonderProxy customer running tests and something breaks, you can log onto the WonderProxy website and know immediately if the proxy is broken, or if you should be looking harder at your own infrastructure.

Thoughts? Does your host do better?