<?paul

Auth across 101 servers in 53 countries

Wednesday, July 3. 2013

Authentication has been an interesting problem at WonderProxy: we currently have 101 active public servers, and hundreds of active users who each have access to a particular subset of those servers. Every user has the ability to add new servers to their account at will, and expects newly-added servers to work quickly. On our end, when a user’s account expires, their credentials need to be removed promptly.

Centralized Auth

When we started, we created a centralized authentication scheme: each proxy instance called an authentication URL when a user attempted to connect, and successful responses were cached for a time. This was easy to write, and allowed us to maintain a single canonical copy of our authentication: the database.

It did however give us two big problems:

Massive single point of failure
High latency for distant locations

The single point of failure was a looming problem that thankfully only raised its head twice. Our central server sat in a rather nice data centre, with a top-end provider, but it was still a huge risk anytime work was being done on the server or its software. As the network grew, this clearly needed to change.

It was actually the latency issue that prompted us to move to a new solution. Users of our Tokyo proxy reported problems where requests were taking too long to execute, or simply timing out. We eventually isolated the cause as being timeouts on authentication, exacerbated by some packet loss on the ocean crossing.

Distributed Auth

Our solution involved creating two new columns in our servers table: users_actual, and users_desired. These integers represent the actual version of the authentication file on that server, and the desired version. When a user adds a server to their account, we update that server’s row, setting users_desired = users_actual + 1. When a contract expires, we update the servers that contract had access to in a similar manner.

In addition, we have a cron job running every minute, identifying servers where users_desired > users_actual. The cron job finds users with access to the server in question, pushes a new authentication file to those servers, and updates their users_actual to match users_desired when the operation returns. This is managed within a MySQL transaction to avoid race conditions.

On the administration side, we have a button on each contract management page that allows us to update the users_desired for all servers accessible to that contract’s users. This extra push is rarely used, but helpful in some weird cases.

By managing things with auth versions (rather than simply pushing credentials to all the servers as needed), we’re able to handle servers being down during the push. When you’re managing a network with 70 different suppliers, they can’t all be winners, so this happens with some frequency. If we simply pushed on demand we’d need a secondary mechanism to handle re-pushing later to the recovering server. By using auth versions, we have one mechanism that handles authentication. By setting users_desired = users_actual + 1, we also avoid updating a server repeatedly after it comes online because multiple updates were queued.

Conclusion

This distributed mechanism has worked quite well since rollout, and it’s becoming easier to manage with more granular options now available in our administration tools. I haven’t felt even remotely tempted to change this, which I feel is often a great sign of stability for a system. During our recent migration of properties to a single co-located server here in Toronto, having distributed auth was a great relief; even if things went poorly, our paying customers would still be able to access the network.

Comments »

No Trackbacks

Wouldn't it be easier to create a bunch of MySQL read-only slaves and set up failover to master if slave fails?

#1 Pies (Homepage) on 2013-07-04 07:18 (Reply)

Hi Pies,

To be honest, I'm not sure that MySQL is at a point where I'd call setting up multiple read-only slaves and have auto-fail over managed reliably and automatically across data centers. We're also really only interested in our User data, while predictably the database in question has a lot of other tables.

Managing fail over for the clients (where they're unable to reach the first database server they attempt) also sounds possible, but like a decent amount of work. Placing the data in the machine that needs the answer pretty effectively minimizes latency, a number we've got a pretty good handle on thanks to our ping information: https://wondernetwork.com/pings/

paul

#2 Paul Reinheimer on 2013-07-08 14:24 (Reply)

Name
Email
Homepage
In reply to
Comment	Enclosing asterisks marks text as bold (word), underscore are made via _word_. Standard emoticons like :-) and ;-) are converted to images.

Auth across 101 servers in 53 countries

Centralized Auth

Distributed Auth

Conclusion

Comments »

Hi, I’m Paul Reinheimer, a developer working on the web.

Top Posts

Search