Posts Tagged ‘cloud’

Cool (cloud) ways to architect data storage in web apps

In a video embedded in this post on Werner Vogels’ blog, Joshua Baer of OtherInbox talks with Vogels (the CTO of Amazon.com) about how OtherInbox has been able to use Amazon Web Services(AWS) in imaginative ways to stave off the day when they’ll have to figure out how to run their database on more than one box.

This video was shot (I recognize the room) right after the excellent “Scaling Rails Applications in the Cloud” talk that OtherInbox co-founder and chief hacker Mike Subelsky gave at SxSW interactive. I was at the talk and can attest that it was very informative about the virtues and advantages of AWS (though it didn’t really focus all that much on Rails per se). One of the most interesting things (mentioned in the Josh Baer video) about their scaling strategy is that they were able to take the most data-intensive pieces of their DB and push them into Amazon S3 as files, taking that load off the database. OtherInbox provides email services, so the biggest pieces of data are the bodies of the email messages themselves. Baer points out that the business logic of making a web-based email client like this dictates that you have several copies of each message: the original one (with all the headers, ASCII entities, etc), an HTML version, and a text version. So three files per message, each of which needs to be accessed in different ways. You can imagine that that would be a hell of a lot of reads and writes to the same DB table if you were housing all of that data in an RDBMS. The idea of simply pushing those messages into the Amazon storage cloud as files seems like a pretty elegant solution. The DB still gets hit for the pointers to those messages, but the amount of data you’re moving in and out of the DB becomes drastically reduced.

Conceptually, this reminds me of some of the new, object-and-document-oriented DB-ish things that are coming along, like CouchDB, or Amazon’s own SimpleDB. The idea being that if you have something that can be stored as a set of key/value pairs, you simply jam those pairs into an object DB that doesn’t really care about their structure beyond that. To retrieve them, you run map/reduce-style queries on the dataset and rely on other pieces of your architecture to do the more refined sorting/processing that you might normally let Oracle or MySQL or whatever handle. Or else you just use that map/reduce query to get your list of pointers from SimpleDB, as the InfoWorld article points out:

SimpleDBis meant to be used with Amazon’s Simple Storage Service (S3), because each of the values in the pairs is limited to 1,024 bytes. That’s enough for many strings, but it’s not enough for many content engines. So you store a pointer to the data in S3.

So it seems this is basically what OtherInbox is doing, only it sounded from the talk like they were using Rails’ Active Record with their own DB server located in the cloud instead of using SimpleDB/S3 straight-up. The general approach is the same though, and it strikes me as a very interesting way to solve a lot of problems that people encounter in scaling web applications. One of the best takeaways I had from SxSW 2008’s scaling talk was when the chief engineer of Digg.com said something like “look, you’re always going to have one or two tables that constitute the vast bulk of your DB load. None of the others will even come close.” Which of course is true, but of small consolation if those one or two problem child tables force you to go to some kind of complex DB sharding or replication strategy early on. If OtherInbox becomes popular enough, they’ll have to do that too – Subelsky admitted that much in his talk. But structuring the data the way that they have, taking advantage of the leverage provided by the cloud, means they can concentrate on other things for a good long while.

Tags:

Amazon does AWS CDN — OMG!

A year or so ago I had a client with the need to serve reasonably high-demand video from their website. They had a multitude of videos advertised on the front page of their website and they were driving traffic to the site with ads in primetime television, so we estimated that we could be looking at bandwidth spikes of something in the range of 10-14mbps — nothing groundbreaking, but certainly enough to hose our dedicated box and annoy our hosting company. So we decided to go with the CDN solution our hosting company already offered and ended up paying through the nose. It ended up costing something like 300% of the monthly cost of the dedi box just to leave it running, and even though we passed that on to our client, the charges felt pretty steep.

If Amazon Cloudfront had existed back then, we would’ve been in business at an insanely lower cost — something like less than 1% of what it cost us to go through our large, nationally established hosting company’s CDN. Amazon has been bumping up their Amazon Web Services offerings with bigger and badder options, to the point where they now offer commodity-priced, enterprise-scale paygo computing, storage, and db options for ludicrously tiny amounts of cash. Recently, the NYTimes converted 4TB of TIFF files (scans of archived articles) into PDFs in 24 hours using a cluster of 100 Amazon EC2 machines. That was every NYTimes article from 1851-1922, and it cost them less than $100. CloudFront will bring a similar scale of holyshit computing power to content delivery:

Amazon CloudFront delivers your content using a global network of edge locations. Requests for your objects are automatically routed to the nearest edge location, so content is delivered with the best possible performance. Amazon CloudFront works seamlessly with Amazon Simple Storage Service (Amazon S3) which durably stores the original, definitive versions of your files. Like other Amazon Web Services, there are no contracts or monthly commitments for using Amazon CloudFront – you pay only for as much or as little content as you actually deliver through the service.

Pricing starts at $0.17/GB for edge transfers in the US and $0.21 and $0.22 in Hong Kong and Japan respectively. That’s a tiny amount of money to pay to see one of the last barriers-to-entry of large-scale projects fall, and it’s pretty amazing to see how low that barrier is now. If I worked for Akamai, I’d probably be getting a bit nervous now.

Tags: