We've made significant updates to the infrastructure supporting Where's it up this year. Many of these changes were necessitated by some quick growth from a few thousand tests per day to several million. Frankly without the first two, I'm not sure we could have remained stable much past a million tests per day.
I blogged about our switch to MongoDB for Where’s it Up a while back, and we’ve been pretty happy with it. When designing our schema, I embraced the “schema free world” and stored all of the results inside a single document. This is very effective when the whole document is created at once, but it can be problematic when the document will require frequent updates. Our document process was something like this:
- User asks Where’s it up something like: Test google.com using HTTP, Trace, DNS, from Toronto, Cairo, London, Berlin.
- A bare bones Mongo document is created identifying the user, the URI, tests, and cities.
- One gearman job is submitted for each City-Test pair against the given URI. In this instance, that’s 12 separate jobs.
- Gearman workers pick up the work, perform the work, and update the document created earlier using $set
This is very efficient for reads, but that doesn’t match our normal usage: users submit work, and poll for responses until the work is complete. Once all the data is available, they stop. We’ve optimized for reads in a write heavy system.
The situation for writes is far from optimal. When Mongo creates a document, under exact fit allocation, it considers the size of the data being provided, and applies a padding factor. The padding factor maxes out at 1.99x. Because our original document is very small, it’s essentially guaranteed to grow by more than 2x, probably with the first Traceroute result. As our workers finish, each attempting to add additional data into the document, it will need to be moved, and moved again. MongoDB stores records contiguously on disk, so every time it needs to grow the document it must read it off disk, and drop it somewhere elsewhere. Clearly a waste of I/O operations.
It’s likely that Power of 2 sized allocations would better suit our storage needs, but they wouldn’t erase the need for moving entirely, just reduce the number of times it’s done.Solution:
Normalize, slightly. Our new structure has us storing the document framework in the results collection, and each individual work result from the gearman worker is stored in results_details in its own document. This is much worse for us on read: we need to pull in the parent document, then the child documents. On write, we’re saved the horrible moves we were facing previously. Again, the usage we’re actually seeing is: Submit some work, poll until done, read complete dataset, never read again. So this is much better overall.
We had some slight work to manage to make our front end handle both versions, but this change has been very smooth. We continue to enjoy how MongoDB handles replica sets, and automatic failover.
We’ve recently acquired a new client which has drastically increased the number of tasks we need to complete throughout the day: checking a series of URLs using several tests, from every location we have. This forced us to re-examine how we’re managing our gearman workers to improve performance. Our old system:
- Supervisor runs a series of workers, no more than ~204. (Supervisor’s use of python’s select() limits it to 1024 file descriptors, which is allows for ~204 workers)
- Each worker, written in PHP, connects to gearman stating it’s capable of completing all of our work types
- When work arrives, the worker runs a shell command that executes the task on a remote server using a persistent ssh tunnel. Waits for the results, then shoves them into MongoDB.
This gave us a few problems:
- Fewer workers than we’d like, no ability to expand
- High memory overhead for each worker
- The PHP process spends ~99.9% of its time waiting, either for a new job to come in, or for the shell command it executed to complete.
- High load with 1 PHP process per executing job (that actually does work)
We examined a series of options to replace this, writing a job manager as a threaded Java application was seriously considered. It was eventually shot down due to the complexities of maintaining another set of packages, and due to the reduced number of employees who could help maintain it. Brian L Moon’s Gearman Manager was another option, but it left us running a lot of PHP we weren’t using. We could strip down PHP to make it smaller, but it wouldn’t solve all our problems.
Minimizing the size of PHP is pretty easy. Your average PHP install probably includes many extensions you’re not using on every request, doubly so if you’re looking at what’s required by your gearman workers. Look at ./configure --disable-all as a start.Solution:
Our solution involves some lesser-used functions in PHP: proc_open, stream_select, and occasionally posix_kill.
We set Gearman to work in a non-blocking fashion. This allows us to poll and check for available work without blocking until it becomes available.
Our main loop is pretty basic:
- Check to see if new work is available, and if so start it up.
- Poll currently executing work to see if there’s data available, if so record it.
- Check for defunct processes that have been executing for too long, kill those.
- Kill the worker if it's been running too long.
Check for available work
Available work is fired off using proc_open() (our work tends to look something like this: sudo -u lilypad /etc/wheresitup/shell/whereisitup-trace.sh seattle wonderproxy.com). We save the read pipe for later access, and set it to non-blocking. We only allow each non-blocking worker to execute 40 jobs concurrently, though we plan to increase the limit after more testing.
Poll currently executing work for new data
Check for available data using stream_select() using a very low poll time. Some of our tests, like dig or host, tend to return the data in one large chunk; others, like traceroute, return data slowly over time. Longer running processes (traceroute seems to average 30 seconds) will have their data accumulated over time and appended together. If the read pipe is done (checked with feof()) close them off, and close the process (proc_close)
Check for defunct processes
We track the last successful read from each process, if it was more than 30 seconds ago we close all the pipes then recursively kill the process tree.
Kill the worker
Finally, the system keeps track of how much work it’s done. Once it hits a preset limit, it stops executing new jobs. Once its current work is done, it dies. Supervisor automatically starts a new worker.
Our original site wasn’t actually hugely problematic. We had used a large framework, which often left me feeling frustrated. I’ve lost days fighting with it, on everything from better managing 404s to handling other authentication methods. It was the former that lead to me rage-quitting the framework and installing Bullet. That said, we’re not afraid to pull in a library to solve specific problems, we’ve been happily using Zend_Mail for a long time.Solution:
We’ve re-written our site to use the Bullet framework. Apart from appreciating micro frameworks in generally, they’re particularly apt for delivering APIs. We’ve had great success writing tests under Bullet, and being able to navigate code in minutes rather than hours is an everlasting bonus. This leaves the code base heavily weighted to business logic, rather than a monolithic framework.