The server hosting this blog had some troubles recently, caused (probably) by spikes in e-mail and web traffic happening at the same time. It’s under control now, but that got me wondering: it’s unfortunate that we have a single, not particularly scalable box that we’re depending on to carry that load. Wouldn’t it be nice if we could move to a setup where resources could expand (or contract) as necessary? In particular, cloud computing is all the rage these days (my own employer has a grid, or Amazon with its S3 storage and EC2 compute; as will doubtless become clear, I know essentially nothing about any of those!); what are the barriers between us and that world?
What are the basic requirements? At a minimum, we can’t be tied to a single machine (or assume that we are running on a single machine at any given point in time, since we want to be able to scale up), so we need to diverge the notions of compute and storage. My guess is that we’ll want multiple incarnations of at least one of those concepts; let’s run through some examples and see.
One test case: can we come up with a server that feels pretty much like the one we’re running now, with the exceptions that there could potentially be multiple live instances (with some shared storage) and that we shouldn’t count on long uptimes? (Hmm, how long should the uptimes be? It’d be nice if instances disappeared entirely when nobody was logged in: maybe provide a mechanism where an ssh connection triggers an instance of the server appearing.)
I guess the compute here would look like an OS instance running on a ram disk. For persistent storage, you’d want to provide some sort of NFS-like view of a subset of the cloud’s storage pool. I guess it would be okay if that persistent storage was somewhat slow; you could have a fast ram disk for situations where you needed to have temporary local data. (Hmm, what about situations like compiles? That could get a bit sticky.)
Those abstractions would provide the basic generic server infrastructure plus home directories and shared storage areas. One question is: what parts of the generic server infrastructure change on a somewhat frequent basis? At least frequent enough that you don’t want to spin a whole new ram disk image just for those changes: e.g. when I change my password, I’d rather not have to create a new image.
Or rather, I don’t want to have to create a new image if that’s a heavyweight operation, so let’s try to make it lightweight instead. To do that, I think you’d want to separate off configuration information from other sorts of information. So you’d want to have a mechanism where you can create a base image, e.g. by picking a set of packages from your favorite Linux distro. And then basically apply a diff to that by adding/editing configuration files. If there are easy ways to manage that diff and add/remove packages, it should be easy to tailor the image that you’re using.
One question: is this case of mimicking shell access to a single Unix box actually useful? It doesn’t solve big compute problems; and if you have a small compute problem, it’s cheap enough to get compute power that you can run at home. So maybe there’s no real need for traditional shell access in a cloud-based world. I’m not convinced, though: e.g. if you want to run custom software in the cloud, you need to be able to compile it (assuming it’s in a compiled language…), and I don’t want to require my home Linux distro (or, for that matter, my home machine’s architecture) to match whatever’s running on the cloud.
Enough about traditional shell access (which is, after all, the most boring use case); what about other services? Recently, we’ve had mail problems; what does a mail server look like in the cloud?
It mostly has a separate pool of storage: I guess it might be accessible from the shell view (you might want to mount your mail spool in your home directory?), but there’s not a lot of overlap there. There also might be some amount of per-user customization (.procmailrc files or their moral equivalent) that you’d want to view from your home directory; not a lot, though. You’d want the potential to farm out incoming SMTP connections to one or several computer servers, growing on demand. And you’d want to divorce receiving mail from reading mail: the SMTP server and the IMAP server have to access the same pool of storage, but there’s no reason why you’d want them to run in the same compute environment. And then there’s sending mail and managing mailing lists.
Hmm, mail is pretty boring. Honestly, I’m not sure why we run our own mail server any more: it seems like all the customization issues are solved well enough, so what I really want is somebody to accept mail, filter it for spam, and store it until I retrieve it, and lots of people can do that fine for me. The only hard part is spam filtering, and that’s a hard enough problem that doing it as a hobbyist effort is doomed to failure.
So let’s move on to the web server, which is the most interesting problem. Different people want to install different packages on their web server: different programming languages, different publishing platforms, different platforms for viewing data that may not have originated from that web server, and people are writing new code in all these areas all the time. So it’s nothing like the mail server situation from that point of view.
How does the configuration look? And to what extent can/should compute instances be shared? Right now, we have one web server with all sorts of stuff installed on it, and a large amount of configuration inside the Apache configuration dir. Some of that configuration information is global, some of it is on a per-vhost basis; there’s also configuration information in individual users’ directories (in the form of .htaccess files). I tend to think that, in a cloud view, that’s the wrong way to slice-and-dice: on a package level, the fact that I want, say, access to Ruby doesn’t mean that everybody does. So probably each vhost gets its own compute configuration? (Of which there could be one or many or even zero instances running at any given time, depending on the traffic load.)
Besides the configuration information, you obviously need storage, for the files that you’re serving up. Right now, that storage is sitting inside my home directory, and in general it would make sense for the storage that’s used by the web server to also be mounted from the shell account view of the cloud. Though, these days, thinking about editing files in your home directory is perhaps a bit passe. In fact, what I typically do is edit files on my home computer and then push them to the server via rsync. (Well, what I typically do is write blog posts, which are stored in a completely different location, about which more later.)
Maybe we should take that idea and run with it: there’s a repository that the web server dishes up to the outside world, but I want to be able to edit it from a remote computer. Using rsync is okay for that, but we have better tools now for managing remotely modifiable repositories: what we really want is a version control system, so I can see a diff before committing, so I can commit from multiple locations, so I can roll back my mistakes. That seems like a useful abstraction that a cloud-based environment should provide: its storage layer should have version control primitives that can be implemented efficiently and are strong enough to, say, let you write a subversion filesystem that works off of it. (I’m under the impression that Google has a subversion filesystem that works with their own internal storage cloud.) That would also be useful for the configuration part of my earlier view that the compute environment consists of a set of packages plus a diff giving configuration information.
Also, to the extent that we’re serving up flat files, we’d like to have as much of that be handled directly by the cloud’s storage abstraction, instead of having it done by the compute layer that’s running the web server. Ultimately, more and more of the web server will, I think, have it just be acting as a proxy. If it figures out that it wants to serve up a file that’s sitting in the storage cloud at some address, it shouldn’t look up the contents of that file through the NFS view of the storage cloud that I hallucinated above, it should simply forward that web request to a web front end of the storage cloud, and let the bits flow. Ideally, the bits wouldn’t even be flowing through the web server at all, once the web server has identified the correct bits; c.f. Van Jacobson’s Google talk.
Of course, these days a lot of the content on a web server (e.g. this immortal prose) isn’t sitting in some directory hierarchy mirrored by the web server: it’s sitting in a database. So any web server platform needs to provide that; that’s an important enough concept (and one that’s different enough from a traditional filesystem) to deserve its own separate abstraction. Not sure what to say here, other than that it would be nice for query results to be RESTful enough that, whenever possible, we can forward them directly from the database cloud to the user, with the web server cloud doing as little work as possible. (Which requires restructuring your web pages; see Tim Bray’s “The Real AJAX Upside”.)
Hmm, so a web server needs:
- A set of packages that you can select to provide the basic functionality.
- Configuration that’s easy to modify. (Ideally with a VCS view.)
- Storage that’s easy to modify. (Ideally with a VCS view, ideally massaged as little as possible by the web server.)
- A database abstraction. (Ideally with a RESTful view, massaged as little as possible by the web server.)
What am I leaving out? Logging, I guess: you want to be able to figure out who is accessing your data. Which might need a different twist on the storage abstraction: you want fast appends, possibly with indexing to make data analysis easier. Hmm, there’s probably something there that could be shared with the mail server concepts.
Enough about the web server: any other big concepts? We host subversion repositories at red-bean; probably if we can get the above working right, we can get that working right, too. And then there’s big compute projects; I’m sure the cloud has something useful for HPC-style applications, I’m sure the above abstractions aren’t good enough for it, I’m sure my employer has lots of productive ideas in that vein, I just don’t have enough first-hand experience to say anything in particular.
I wonder what any of this has to do with existing cloud abstractions? I’ve read a bit about S3 (it was mentioned as an example in the REST book), but I don’t know squat about EC2, and I don’t even remember the name of Amazon’s database abstraction. I don’t see anything in the picture that I sketched above that should be too difficult to bring to fruition: you can start by providing a traditional view of cloud resources, and then do a few targeted replacements to make it significantly more efficient. (E.g. teaching Apache how to interact directly with the storage server, instead of routing requests through a file system layer.) And then you’ll want to start rethinking your designs, increasing the range of components that can be addressed directly as resources (and hence provided in an optimized way by the cloud core abstractions), stitching them together on the client whenever possible.
Should be fun. I suspect that the next red-bean upgrade will be to another physical machine, but for the one after that, who knows?
Post Revisions:
There are no revisions for this post.
Well, you’ve pretty much characterized EC2 correctly, except that even the ten-cents-an-hour compute instance is fairly decent: it’s a (virtual) single-core 1 GHz 32-bit system with 2 GB of memory and 160 GB of local disk, which you lose when the instance goes down. Fortunately, you can trivially get remote access to S3 (no charge) for persistent storage, though you don’t want to keep your database there because writes to S3 are slow. (Instead, you back up your database there using whatever mechanism your database engine provides for backup). And of course you can run tens or (if you ask nicely) hundreds of these things.
If your job is memory- or compute-intensive, there’s also the forty-cents-an-hour model, with two 2-GHz 64-bit cores, 7.5 GB memory, and 850 GB of local disk; the eighty-cents model has four cores and twice the memory and disk. So provided you don’t have a monstrous large server, and provided you can adapt to semi-offline storage for most of your storage needs, you can just move everything there and apply the $585/mo for the big system against whatever it’s costing you to own your system now.
A friend of mine makes a living doing made-to-order Web crawls for customers who are looking for specific kinds of things on the Web. He uses S3 and EC2 exclusively, so he had no capital cost to start up his company at all. (The crawler is a command-line application, so ssh access is what he needs and gets.) He typically experiences instance uptimes in the general range of 30-60 days.
It’s true that EC2 doesn’t provide the kinds of things you spend most of the post talking about, but then it doesn’t need to. As this model becomes more popular, people will begin provide userland software that is intended to run on tens of instances, rather than on a server and can be streamlined appropriately.
2/3/2008 @ 4:19 pm
You are making it too hard, I think. Just move half the domains to a second machine. Or a few of the big usage ones. Or move mail to the new machine, if it’s web and mail evenly. Or put the few files that are most of the usage on S3, if it is just people getting the SVN book.
2/4/2008 @ 6:14 pm
John: thanks for the info. Good point about the local disk; that would solve the compile issue nicely. I’d be leery about using that method for a database, though, but clearly S3 isn’t the solution there, either. How does Amazon’s database abstraction work? I’d heard some iffy things about it.
Brian: yeah, getting a second machine might make sense. I hadn’t thought of putting svnbook on S3, but that could be a good idea, too. Hmm.
2/4/2008 @ 9:05 pm
Well, what do you do with your database today? Do you have a full two-phase-commit distributed database for reliability, or do you take frequent backups and risk the loss of a certain number of transactions?
If the latter, then backing up the database to S3 is perfectly plausible, using its native backup tools.
I don’t know much about Amazon’s database, except that it’s generally similar to Google’s BigTable: a three dimensional spreadsheet using arbitrary row and column names and the update timestamp as the third dimension. In BigTable, at least, we can further say that the row is the unit of atomicity, that iterations are over rows whose names share a prefix, that the column (or group of columns) is the unit of access control, and that older versions of data can be garbage collected.
2/5/2008 @ 5:44 am
Well, today I’m just running a blog, so it doesn’t matter so much. I don’t know how often contents get synced to disk, but that should be, what, no more than every 30 seconds? And we do offsite backups nightly. Good enough for a blog, given that the server only goes down unexpectedly once every few months; if I were doing something more serious, I’d want to RAID the disks, and perhaps back up more frequently.
Then again, maybe I could get close enough to that with the Amazon solution: dump the db contents to S3 every 10 minutes or so? I guess it depends on how often the compute instances go down, and what the costs would be for frequently backing up to S3. Still feels wrong, but maybe that’s partly because I’m not used to thinking in those terms.
2/5/2008 @ 9:52 pm