Authentication has been an interesting problem at WonderProxy: we currently have 101 active public servers, and hundreds of active users who each have access to a particular subset of those servers. Every user has the ability to add new servers to their account at will, and expects newly-added servers to work quickly. On our end, when a user’s account expires, their credentials need to be removed promptly.
When we started, we created a centralized authentication scheme: each proxy instance called an authentication URL when a user attempted to connect, and successful responses were cached for a time. This was easy to write, and allowed us to maintain a single canonical copy of our authentication: the database.
It did however give us two big problems:
- Massive single point of failure
- High latency for distant locations
The single point of failure was a looming problem that thankfully only raised its head twice. Our central server sat in a rather nice data centre, with a top-end provider, but it was still a huge risk anytime work was being done on the server or its software. As the network grew, this clearly needed to change.
It was actually the latency issue that prompted us to move to a new solution. Users of our Tokyo proxy reported problems where requests were taking too long to execute, or simply timing out. We eventually isolated the cause as being timeouts on authentication, exacerbated by some packet loss on the ocean crossing.
Our solution involved creating two new columns in our servers table: users_actual, and users_desired. These integers represent the actual version of the authentication file on that server, and the desired version. When a user adds a server to their account, we update that server’s row, setting users_desired = users_actual + 1. When a contract expires, we update the servers that contract had access to in a similar manner.
In addition, we have a cron job running every minute, identifying servers where users_desired > users_actual. The cron job finds users with access to the server in question, pushes a new authentication file to those servers, and updates their users_actual to match users_desired when the operation returns. This is managed within a MySQL transaction to avoid race conditions.
On the administration side, we have a button on each contract management page that allows us to update the users_desired for all servers accessible to that contract’s users. This extra push is rarely used, but helpful in some weird cases.
By managing things with auth versions (rather than simply pushing credentials to all the servers as needed), we’re able to handle servers being down during the push. When you’re managing a network with 70 different suppliers, they can’t all be winners, so this happens with some frequency. If we simply pushed on demand we’d need a secondary mechanism to handle re-pushing later to the recovering server. By using auth versions, we have one mechanism that handles authentication. By setting users_desired = users_actual + 1, we also avoid updating a server repeatedly after it comes online because multiple updates were queued.
This distributed mechanism has worked quite well since rollout, and it’s becoming easier to manage with more granular options now available in our administration tools. I haven’t felt even remotely tempted to change this, which I feel is often a great sign of stability for a system. During our recent migration of properties to a single co-located server here in Toronto, having distributed auth was a great relief; even if things went poorly, our paying customers would still be able to access the network.