Testing the limits of cloud scalability: Real-world results

July 12, 2012, 12:41 PM PDT

Takeaway: Thoran Rodrigues describes a recent experience that involved spinning up 70 cloud servers for an intensive data-processing project. Here is what he learned from this real-world experiment.

Scalability is probably the greatest promise of cloud computing. In the infrastructure level, it translates to being able to quickly deploy new virtual servers from existing machines and then drop these servers when they arent needed anymore. Not only that, but it should be simple and easy to scale each individual server up and down as needed, adding or removing processors, RAM, and storage space. If we look at the whole cloud stack, scalability at this level is more than a promise: it is a necessity.

Without scalability at the infrastructure level, there can be no auto-scaling cloud platforms that transparently increase available resources to accommodate application needs, nor can we have applications that have a large variability in the number of users at any time without the need for peak load provisioning.

Over the course of the past month, I had the opportunity to test the limits of infrastructure-as-a-service scalability by running a computing and network intensive process on several servers. Id like to share the key points of this experience and the lessons learned throughout.

This experiment wasnt really a test, but rather a process I was running for a client. This actually makes it more interesting, because its a real production environment, rather than a simple or controlled test. This means it was under all the traditional pressures and requirements of a production environment, such as availability, redundancy, and so on. The process consisted of running several web searches (on both search engines and regular websites), followed by heavy HTML, XML, and JSON processing, string matching, file format adjustments, and so on.

I made an estimate that running this process on a single 1 CPU, 1GB RAM server would take more than a year, but the client wanted the results in less than a month. The only way to deliver was to break the process down into smaller blocks that could be run on separate servers at the same time: enter parallelization and cloud scalability. By saving a basic machine image and replicating it dozens of times, then processing each small block on a separate server, Id be able to finish everything up much faster.

In the end, my input data was broken down into 70 different blocks, so I set out to deploy 70 cloud servers using a standard cloud provider (Rackspace in my case) as fast as I could. In this case, I opted to deploy the cloud servers through the control panel, instead of via the API, just to see what would happen. The first thing I did was create the simplest possible Windows server (1 processor, 1 GB RAM, 40GB Disk), prepare the image and save it, so that I could later quickly create new servers from this image.

For those who are new to this virtual server thing: creating the image correctly can save you a lot of time. If your servers are all going to have the same directory structures with the same installed programs and so on, preparing the first image properly means that you dont have to worry about it with any of the others. And if, like me, you are going to run a process on several files but have excess disk space, copy everything to the first server. Since the images are full disk images, all files get copied, and you can actually save that setup time.

So I had my machine image created and started deploying new servers. The first 37 went up without a hitch, in less than an hour. Thats more than one new server every two minutes, an impressive rate. Upon trying to spin up server number 38, however, I got an interesting surprise: the Rackspace console started returning a failure upon creating the server with the following message: Account has exceeded update limit. Try again at [YYYY-MM-DD HH:MM]. Please call [X-XXX-XXX-XXXX] if you have any questions. I got in touch with their excellent customer service, who quickly replied that all accounts come with a built-in limit of 50GB of RAM usage. While it was easy enough to increase that limit (just open up a support ticket), this limitation should be more visible. In fact, Rackspace support informed me that the only way to get at the current limit was through their API, which makes no sense.

Original post:
Testing the limits of cloud scalability: Real-world results

Related Posts

Comments are closed.