The tool procured some great press last week with SwissMiss tweeting about us, and Life Hacker picking it up. Accordingly, things broke. This was rather unfortunate as I'd sort of planned for traffic spikes, and this sort of thing shouldn't have happened. Two key things went wrong:

  • The pecl_http extension seg faulted on certain requests. supervisord promptly restarted the worker, which then picked up a similar job from the queue, and seg faulted again. This happened enough times that supervisord gave up on the worker and left it shut down. Gearman detected that the job was never completed, and re-queued it, ready to crash another worker when it came up. The issue was ultimately caused by incomplete HTTP response headers lacking the reason phrase, a number of systems seem to omit that message, crashing the worker.

    Ilia was able to patch pecl_http, and we've updated to the more recent release to obtain that fix.

  • gearman re-submitted crashed jobs forever. This ensured that all workers eventually died with even a very small number of requests that caused the workers to crash. This is silly for a few reasons. First, it allowed this to happen. Second, these requests are timely, there's no point in trying something thirty times as by then the web client has given up on the request itself and the data will never be used.

    We've re-configured the gearman system to only retry a job once. This should allow random issues to be retried, but prevent pervasive problems from crashing the system. If you're running gearman I'd strongly suggest supplying some sort of maximum retried value using --job-retries=N

We've come through the outage and fixed both elements of the problem (though fixing either of them would have prevented an identical issue from causing a problem). We're also looking at better ways of monitoring this to be informed of problems sooner.

Our apologies for the outage.

Comments »

No Trackbacks
No comments

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.

Hi, I’m Paul Reinheimer, a developer working on the web.

I wrote a book titled Professional Web APIs with PHP back in 2006, and am currently working in Biomedical Informatics for a major public health company.

I’m working on a project to help developers called WonderProxy which has proxies all over the world. Working on GeoIP development? Now you can finally test properly! We've also released Global Ping Statistics for expected ping times between cities, as well as a Load Testing Tool to measure your site's ability to handle load. Our most recent site checking tool is Where's it Up? which checks your sites availability globally, returning HTTP, DNS, and Traceroute details

My hobbies are cycling, photography, travel, and engaging Allison Moore in intelligent discourse. I frequently write about PHP and other related technologies.

I co-founded:

WonderNetwork Logo