I maintain Octave Online, a web UI for the GNU Octave computational engine. The site is enjoyed by tens of thousands of users from all around the world, and well over a million commands have been executed since the beginning of this year. What kind of architecture and infrastructure supports a web site of this scale?
Over the past two years, Octave Online has gone through several different iterations of back-end architecture, with each update improving the performance and decreasing the bounce rate. In this blog post, I'm going to give an overview of the current architecture that makes Octave Online robust and scalable to meet the needs of a growing audience.
Note: The code for Octave Online is in the long-term process of being prepared for open sourcing. When the code is released, I will publish more material on this blog about it.
The very first alpha version of Octave Online had the simplest architecture imaginable: a single cloud instance that would spawn GNU Octave processes on itself and read/write to their standard input/output streams. This is an extreme example that help illustrates some of the problems that a good back-end architecture needs to solve.
- Scalability: The architecture should be able to scale to support any number of simultaneous users.
- Performance: The architecture should ensure that the user is able to get fast I/O throughput, from any part of the world.
- Reliability: If any piece of the architecture goes offline, the application should pick up where it left off when the piece comes back online again.
Objective 1 means that as traffic increases on the site, I should be able to dynamically increase the number of servers to support the increased load. Objective 2 means that the architecture should be able to support an interconnected architecture across multiple data centers. Objective 3 means that each component needs to maintain state as best it can during unexpected outages, and also that it can return to its previous state efficiently.
Okay, so with the objectives in mind, let's cut to the chase. The following diagram illustrates the architecture of Octave Online, which I will explain in more detail below.
When an end user connects to Octave Online, they are actually creating a web socket connection to a front server, which is a relatively simple Node.JS application. Upon connection, the front server assigns the client an ID, and adds that ID to a queue in a Redis cache. Within a short amount of time, one of multiple back servers will pop the ID from the queue and locally spawn an Octave process. If the connection is from a returning user, the front server looks them up in a Mongo database (not shown) and the back server will clone an instance of the user's repository from a Git server (not shown). We're now ready to start running commands.
When the end user issues a command and sends it over the web socket, the front server will publish that command into a Redis channel identified by the user's session ID. The back server that adopted the corresponding ID will have subscribed to that channel, so it receives the message, and sends the command down to the Octave process. When the Octave process produces output, the back server publishes the output to another Redis channel to which the front server is subscribed, and finally the front server sends the output back to the end user over the web socket connection. The total round-trip time through the stack never stops impressing me.
When the end user disconnects or reloads their browser window, the front server sends a message over Redis to inform the back server that it can shut down the corresponding Octave process. The back server then kills the Octave processes, and if necessary, commits and pushes any changes to the user's repository back to the Git server.
Why the Redis cache?
One of the most common questions I get is, why the Redis cache? Why not have the front servers connect directly to the back servers with some sort of socket connection? The answer to this question is twofold, and arises from all three objectives.
Reason 1: Load Balancing
High performance and scalability means that we need to load-balancing the work across multiple back server instances. How do you perform the load balancing? One could use a traditional approach, like round-robin, for assigning the jobs. But what if the back servers could decide on their own when they were ready to accept new jobs?
This is one of my favorite parts of the Octave Online architecture. Each back server is treated as its own agent and controls its own destiny. If the back server is fresh and can handle more connections, it goes and pulls from the queue. If the back server is busy with many different jobs, it ignores the queue and lets other servers pull from it instead. A back server may also elect to go offline entirely and perform maintenance chores, like cleaning up local storage and killing orphaned processes. In order to ensure that there is at least one back server online and accepting connections at all times, the back servers talk to one another, and when one wants to go offline, it needs to get approval from at least half of the other back servers.
Redis is an important piece by providing constructs like a distributed priority queue and pub-sub channels. Redis is the framework through which the back servers talk with each other and with the front server that contains the connection to the end user.
Reason 2: State Recovery
If a user briefly loses their internet connection, or if the front server goes offline for a bit, we would like for the user to be able to reconnect to a front server, which may be different than the one to which they were previously connected, and connect to their same Octave process if it still exists.
Redis makes this possible through the pub-sub framework. If a user re-connects and provides a pre-existing ID, the front server simply sets up a Redis client listening on that ID for output from the Octave process. It is completely oblivious to the identity of the back server that is actually running the process.
Note: Each front server continuously touches a key in the cache for each of its ID. If the front server goes offline, those keys have a 15-second expiration. If they expire, the back server gets a notification and destroys the corresponding Octave processes. However, if the user reconnects to a different front server, or if the front server comes back online, the keys will be touched again, and the Octave process remains alive.
The newest version of Octave Online brings support for real-time collaborative editing and sharing of the same Octave session. This was relatively easy to implement in the current architecture. Although the details of how this works is beyond the scope of this blog post, the high-level idea is that each end user is subscribed to the same session ID, so they receive the same notifications when the Octave process produces output, for example. I'm hoping to write another blog post detailing some of the difficulties of implementing collaborative editing and how I overcame them with my code.
In principle, each component of the architecture could be running on different machines, the same machine, or some mixture of the two.
Right now, I have Octave Online hosted in the Rackspace cloud. I have four cloud VMs. One of them holds a front server instance and is also home to the Redis cache, Mongo database, and Git server. The other three VMs are each running an instance of the back server and are designed to handle the demanding job of dynamically creating and destroying thousands of Octave sessions each day, in a fast and secure manner (another topic worthy of its own blog post). These four servers have proven sufficient for handling the typical load I currently get on Octave Online. The beauty is that if I ever need to add more front servers or back servers, it's as easy as going into the Rackspace control panel and clone more instances.
I hope that you found this blog post interesting. Feel free to leave comments/questions below!