The tool procured some great press last week with SwissMiss tweeting about us, and Life Hacker picking it up. Accordingly, things broke. This was rather unfortunate as I'd sort of planned for traffic spikes, and this sort of thing shouldn't have happened. Two key things went wrong:
The pecl_http extension seg faulted on certain requests. supervisord promptly restarted the worker, which then picked up a similar job from the queue, and seg faulted again. This happened enough times that supervisord gave up on the worker and left it shut down. Gearman detected that the job was never completed, and re-queued it, ready to crash another worker when it came up. The issue was ultimately caused by incomplete HTTP response headers lacking the reason phrase, a number of systems seem to omit that message, crashing the worker.
Ilia was able to patch pecl_http, and we've updated to the more recent release to obtain that fix.
gearman re-submitted crashed jobs forever. This ensured that all workers eventually died with even a very small number of requests that caused the workers to crash. This is silly for a few reasons. First, it allowed this to happen. Second, these requests are timely, there's no point in trying something thirty times as by then the web client has given up on the request itself and the data will never be used.
We've re-configured the gearman system to only retry a job once. This should allow random issues to be retried, but prevent pervasive problems from crashing the system. If you're running gearman I'd strongly suggest supplying some sort of maximum retried value using --job-retries=N
We've come through the outage and fixed both elements of the problem (though fixing either of them would have prevented an identical issue from causing a problem). We're also looking at better ways of monitoring this to be informed of problems sooner.
Our apologies for the outage.