In a video embedded in this post on Werner Vogels’ blog, Joshua Baer of OtherInbox talks with Vogels (the CTO of Amazon.com) about how OtherInbox has been able to use Amazon Web Services(AWS) in imaginative ways to stave off the day when they’ll have to figure out how to run their database on more than one box.
This video was shot (I recognize the room) right after the excellent “Scaling Rails Applications in the Cloud” talk that OtherInbox co-founder and chief hacker Mike Subelsky gave at SxSW interactive. I was at the talk and can attest that it was very informative about the virtues and advantages of AWS (though it didn’t really focus all that much on Rails per se). One of the most interesting things (mentioned in the Josh Baer video) about their scaling strategy is that they were able to take the most data-intensive pieces of their DB and push them into Amazon S3 as files, taking that load off the database. OtherInbox provides email services, so the biggest pieces of data are the bodies of the email messages themselves. Baer points out that the business logic of making a web-based email client like this dictates that you have several copies of each message: the original one (with all the headers, ASCII entities, etc), an HTML version, and a text version. So three files per message, each of which needs to be accessed in different ways. You can imagine that that would be a hell of a lot of reads and writes to the same DB table if you were housing all of that data in an RDBMS. The idea of simply pushing those messages into the Amazon storage cloud as files seems like a pretty elegant solution. The DB still gets hit for the pointers to those messages, but the amount of data you’re moving in and out of the DB becomes drastically reduced.
Conceptually, this reminds me of some of the new, object-and-document-oriented DB-ish things that are coming along, like CouchDB, or Amazon’s own SimpleDB. The idea being that if you have something that can be stored as a set of key/value pairs, you simply jam those pairs into an object DB that doesn’t really care about their structure beyond that. To retrieve them, you run map/reduce-style queries on the dataset and rely on other pieces of your architecture to do the more refined sorting/processing that you might normally let Oracle or MySQL or whatever handle. Or else you just use that map/reduce query to get your list of pointers from SimpleDB, as the InfoWorld article points out:
SimpleDBis meant to be used with Amazon’s Simple Storage Service (S3), because each of the values in the pairs is limited to 1,024 bytes. That’s enough for many strings, but it’s not enough for many content engines. So you store a pointer to the data in S3.
So it seems this is basically what OtherInbox is doing, only it sounded from the talk like they were using Rails’ Active Record with their own DB server located in the cloud instead of using SimpleDB/S3 straight-up. The general approach is the same though, and it strikes me as a very interesting way to solve a lot of problems that people encounter in scaling web applications. One of the best takeaways I had from SxSW 2008’s scaling talk was when the chief engineer of Digg.com said something like “look, you’re always going to have one or two tables that constitute the vast bulk of your DB load. None of the others will even come close.” Which of course is true, but of small consolation if those one or two problem child tables force you to go to some kind of complex DB sharding or replication strategy early on. If OtherInbox becomes popular enough, they’ll have to do that too – Subelsky admitted that much in his talk. But structuring the data the way that they have, taking advantage of the leverage provided by the cloud, means they can concentrate on other things for a good long while.







Copyright © 2010 Catapult Creative - info(at)catapult(hyphen)creative(dot)com - Powered by