If you need to quickly exchange updates between multiple servers, one option is to use a shared database. As always, the best technology choice depends on the specifics of your problem. And sometimes, you find a great solution that you only wish you’d discovered earlier. In this article, we’ll explain how we designed a system for executing the large amounts of code that candidates and interviewers write on our cloud platform. We’ll also discuss how we recently moved the logic for this use case to Redis after experiencing performance issues with Meteor observers on MongoDB.
Quick note: our platform still allows end users to create and solve MongoDB tasks, useful for testing NoSQL skills!
Background: Running code in the cloud
One of the functions that CodeSignal’s backend performs a lot is, perhaps unsurprisingly, running code. We’re an interview tool, we do certifications, and we offer interview practice. All of that involves asking people to write code in our web IDE and then run it to see the output or check against test cases. Everything works in your browser, and we run the code with special microservices that we call coderunners.
How do we send run requests to a coderunner and then show the output of the executed code in our IDE? Originally, our approach was to treat the coderunners sort of like the cash registers in a grocery store: a server with a request, like a shopper with a grocery cart, would wait until a coderunner opened up and then try to snag it before anyone else did.
There’s been some research done on this free-for-all approach, and the consensus is that while it’s good at keeping the average time down, it tends to create extremes—some requests get processed right away, while others are unlucky and end up waiting for a long time. In pursuit of more fairness, better guarantees, and better use of CPU and memory resources, we recently transitioned to a new design.
Common queue architecture
Our new approach relies on what’s called a “common queue.” Instead of fending for themselves, the servers that receive requests from the IDE (we’ll call these “A” servers) form an orderly line that’s managed by a central “B” server. We refer to the B server as the “matchmaker” because it takes care of a few things:
- Routing: We have different coderunners for different types of requests, and each machine can only process one request at a time. The B server looks at the type of request and which coderunners are currently available, and uses this to determine where the request should be routed.
- Prioritization/Ordering: The B server also establishes the order for processing the requests. In general we want to ensure that the requests are addressed in a first in, first out order, but there’s some added complexity because certain requests types have different priorities (e.g. client requests vs. arcade mode submits). And sometimes all the coderunners that can handle the request will be occupied, and we’ll have no choice but to wait.
At this point, you might be wondering, why do we even have the A servers at all? Why not just send all the run requests directly to the B server? The primary reason is that these requests can be pretty big (e.g. 5 MB). Large requests can occur when the code needs to execute against a test case containing a large array. Another example is our database tasks, which let you load and run queries against large SQL data.
In order to do the routing quickly, the B server processes all requests in memory—and we can only have one B server, acting as the single source of truth. The result is that we can’t afford to overwhelm the B server and slow it down with large requests. So along with various CPU optimizations, we use the A servers for load balancing. They only communicate a small part of the full request to the B server: the part that’s needed for prioritization and request routing.
To sum up the flow:
- An A server, let’s call it A1, gets a run request.
- It passes a small amount of information—basically, the request type and when the request was received—to the central B server.
- Based on the priority and the available coderunners for this type of request, B either matches A1’s request right away, or it has to wait.
- Eventually, there’s a match (B checks the timestamp on A1’s request to make sure that no requests are left waiting around for too long). Then B lets A1 know which coderunner it should use to execute the request.
When we look at the data, we find that this approach ensures that our request latency distribution is much less variable. And we actually get the same average latency as before—it’s a win-win.
MongoDB / Meteor observers -> Redis Pub/Sub API
To make this architecture possible, we needed to pass information back and forth between the A servers and the B server with as little latency as possible. It made sense to use a shared database to achieve this. For several months, we tested out MongoDB + Meteor observers, but we recently migrated to Redis where we’re seeing much better performance.
Issues we encountered with MongoDB / Meteor observers
MongoDB is a nonassociative database that uses JSON-like documents to store data. We use MongoDB across our backend and decided to try it as the shared database for the A and B servers. We used Meteor for those servers, and Meteor’s API for MongoDB provides observers, which are designed for keeping up with changes to the database. However, as we discovered, they aren’t ideal for real-time communication.
How do Meteor observers work? When you execute a query, it gives you the current state for your selection; if you want to know when the state changes in the future, you can put an observer on your query. The observer will give you callbacks when there are updates, like if a new document was added to or removed from your query.
In our case, we used observers in a few ways. The B server would run a query + observer to get all the requests in the database, ensuring that when a new request document appeared, the callback would be triggered and B could start to handle the request. And on the other side, the A servers would run a query + observer for their request document, to know when B had found a match to a coderunner.
// Mongo observers approach
// In A servers
app.post('/process-request', (newRequest, result) => {
db.requests.insert({ ...newRequest, status: 'pending-match', ownerId: OWN_ID });
});
const pendingExecutionRequestsCursor = db.requests.find({ status: 'pending-execution', ownerId: OWN_ID });
pendingExecutionRequestsCursor.observe({
added(matchedRequest) {
// start processing request
...
},
});
// In B servers
const pendingMatchRequestsCursor = db.requests.find({ status: 'pending-match' }, { fields: { requestInfo: true } });
pendingMatchRequestsCursor.observe({
added(requestInfo) {
// start matching request
...
db.requests.update(requestId, { $set: { status: 'pending-execution', matchingInfo } };)
},
});
Early on with this implementation, we started to notice performance issues. The observers could become slow in ways that weren’t predictable. For example, after a request was added to the database, the B server might get a callback after 5 or 10 seconds(!). We couldn’t find any guarantees on latency from MongoDB and Meteor for observers. And restarting the DB or observers wasn’t fixing the issue.
Finding success with Redis
To address these performance problems, we decided to move the shared DB to Redis. Redis, which is open source (and pronounced “red-iss,” in case you were wondering), is an in-memory database—it stores all the data in RAM. Because of that, manipulating the data is faster than using MongoDB or MySQL, and there are more guarantees on performance.
The equivalent of observers for Redis is the Pub/Sub API. You’ve probably heard of pub/sub logic in other contexts, like RSS feeds for example. It’s the same idea: You can create a channel and publish to it, and other clients can subscribe to the channel and get notified each time something is published there. Using the Redis Pub/Sub API, we actually don’t need to insert into a database or use queries at all.
The way it works is that the A severs publish their request information into one main channel, and the B server subscribes to get all their updates. Each A server is subscribed on its own special side channel. When the B server finds a match for A1’s request, for example, it publishes this info to A1’s channel.
// Redis pub/sub approach
// In A servers
const pendingMatchRequestsPubSub = new PendingMatchRequestsPubSub();
const = new PendingExecutionRequestsPubSub();
app.post('/process-request', (newRequest, result) => {
const { requestInfo } = newRequest;
pendingMatchRequestsPubSub.publish(requestInfo);
});
pendingExecutionRequestsPubSub.subscribe(({ matchingResult }) => {
// start processing request
...
});
// In B servers
const pendingMatchRequestsPubSub = new PendingMatchRequestsPubSub();
const pendingExecutionRequestsPubSub = new PendingExecutionRequestsPubSub();
pendingMatchRequestsPubSub.subscribe(({ requestInfo, ownerId }) => {
// start matching request
...
pendingExecutionRequestsPubSub.publish(matchingResult, ownerId);
});
We’ve been using Redis for a few months and the benchmarks are really good. We had a stress test recently when a customer held a global recruiting event, and the system passed with flying colors. In addition to the performance benefits, we found that implementation-wise, Redis was much simpler than using Meteor observers on MongoDB. With observers, you have to insert them into the DB and update and delete them. But all that (and even more) is handled implicitly by the Redis Pub/Sub logic, so you don’t have to worry about it.
Limitations of Redis
The Redis Pub/Sub API can be a great choice if you’re having performance issues with Meteor observers in MongoDB or just want a more lightweight solution for a similar problem to ours. But it’s important to be aware of the limitations of Redis as well.
Because Redis serves all data from memory and only uses disk for storage, it really only works for small objects (a hard requirement is that you can’t have a dataset larger than memory). The data that A and B share about requests is pretty small, on the order of 1-2 KB, so Redis was a good fit for our use case.
You could always use a hybrid approach if you need to store a lot of data long term. According to the Redis FAQ, “a common design pattern involves taking very write-heavy small data in Redis (and data you need the Redis data structures to model your problem in an efficient way), and big blobs of data into an SQL or eventually consistent on-disk database.” Since we’re really just using Redis for messaging, this isn’t an issue for us. As long as the messages are received, we’re good, and we never need to look at that data again.
Redlock algorithm
One final aspect of the architecture that we haven’t talked about yet is how we handle the B server failing. If an A server goes down, there are a lot of other servers ready to pick up the requests and we have some mechanisms for this. But if the B server fails, that’s a bigger problem, since our request architecture can’t function without the matchmaker.
If the B server gets unhealthy or crashes, what we want to do is quickly promote one of the A servers to become the new B server. But which A server? And how will they know if B has failed? We solve this by implementing something similar to the “single instance” version of the Redlock algorithm described by Redis here.
Essentially, we allocate a resource that represents the idea of “being the B server.” The B server calls SET on this resource using its random value key. This works as a lock, because the servers all have logic saying they can only SET this object if they have a matching key. Meanwhile, we specify an expiration time on the SET call. If the object isn’t SET again within X milliseconds, the key is removed by Redis.
So B is calling SET on the resource “being the B server” over and over with its key, and if at some point it dies or slows down substantially, it won’t be able to call SET within the time limit and the key will be removed. Meanwhile, all of the A servers are periodically checking the contents of the “being the B server” resource. If an A server sees that the key has been removed, it tries to SET the object with its own unique key. Whichever A server manages to do this first now holds the “being the B server” lock and gets a promotion. This ensures that our system stays healthy even if the original matchmaker fails.
Conclusion
If you’re ever in a similar situation with Meteor observers and MongoDB, we hope this post is helpful! We look forward to continuing to share more details about the technologies we use at CodeSignal. If you found this post interesting, we’re always looking for passionate SWEs to join the team and help us build the world’s best interview tool—check out our open roles.
Aram Drambyan is a Software Engineer at CodeSignal, mainly focused on infrastructure-heavy projects. He’s like the Dark Knight doing the underground work that saves the world 😉