Archive for March, 2009

Cool (cloud) ways to architect data storage in web apps

In a video embedded in this post on Werner Vogels’ blog, Joshua Baer of OtherInbox talks with Vogels (the CTO of Amazon.com) about how OtherInbox has been able to use Amazon Web Services(AWS) in imaginative ways to stave off the day when they’ll have to figure out how to run their database on more than one box.

This video was shot (I recognize the room) right after the excellent “Scaling Rails Applications in the Cloud” talk that OtherInbox co-founder and chief hacker Mike Subelsky gave at SxSW interactive. I was at the talk and can attest that it was very informative about the virtues and advantages of AWS (though it didn’t really focus all that much on Rails per se). One of the most interesting things (mentioned in the Josh Baer video) about their scaling strategy is that they were able to take the most data-intensive pieces of their DB and push them into Amazon S3 as files, taking that load off the database. OtherInbox provides email services, so the biggest pieces of data are the bodies of the email messages themselves. Baer points out that the business logic of making a web-based email client like this dictates that you have several copies of each message: the original one (with all the headers, ASCII entities, etc), an HTML version, and a text version. So three files per message, each of which needs to be accessed in different ways. You can imagine that that would be a hell of a lot of reads and writes to the same DB table if you were housing all of that data in an RDBMS. The idea of simply pushing those messages into the Amazon storage cloud as files seems like a pretty elegant solution. The DB still gets hit for the pointers to those messages, but the amount of data you’re moving in and out of the DB becomes drastically reduced.

Conceptually, this reminds me of some of the new, object-and-document-oriented DB-ish things that are coming along, like CouchDB, or Amazon’s own SimpleDB. The idea being that if you have something that can be stored as a set of key/value pairs, you simply jam those pairs into an object DB that doesn’t really care about their structure beyond that. To retrieve them, you run map/reduce-style queries on the dataset and rely on other pieces of your architecture to do the more refined sorting/processing that you might normally let Oracle or MySQL or whatever handle. Or else you just use that map/reduce query to get your list of pointers from SimpleDB, as the InfoWorld article points out:

SimpleDBis meant to be used with Amazon’s Simple Storage Service (S3), because each of the values in the pairs is limited to 1,024 bytes. That’s enough for many strings, but it’s not enough for many content engines. So you store a pointer to the data in S3.

So it seems this is basically what OtherInbox is doing, only it sounded from the talk like they were using Rails’ Active Record with their own DB server located in the cloud instead of using SimpleDB/S3 straight-up. The general approach is the same though, and it strikes me as a very interesting way to solve a lot of problems that people encounter in scaling web applications. One of the best takeaways I had from SxSW 2008’s scaling talk was when the chief engineer of Digg.com said something like “look, you’re always going to have one or two tables that constitute the vast bulk of your DB load. None of the others will even come close.” Which of course is true, but of small consolation if those one or two problem child tables force you to go to some kind of complex DB sharding or replication strategy early on. If OtherInbox becomes popular enough, they’ll have to do that too – Subelsky admitted that much in his talk. But structuring the data the way that they have, taking advantage of the leverage provided by the cloud, means they can concentrate on other things for a good long while.

Tags:

IE 8: a little cooler than before; not as cool as it could be

Microsoft released Internet Explorer 8 today, and I’m a lot more pleased to be writing this than I would’ve thought two years ago. IE 7 was a significant improvement over IE 6’s miserable standards support, but as Ars Technica puts it, it was still a catch-up release, and IE 8 is a genuine attempt to compete with the new crop of browsers that have come up since IE 7 was released:

what really needs to be emphasized here is that IE8 puts Microsoft back in the game. IE7 was a catch-up release, there’s no question about that. However, with IE8, which is bigger leap from IE7 than IE7 was from IE6, Microsoft is pulling out the big guns and offering features which other browsers have yet to adopt. It’s good to see Microsoft fight back with a vengeance, but the company has more competition than ever before, from the likes of Firefox, Safari, Chrome, and Opera.

At SxSW interactive last week, the Microsoft panelist in the talk on CSS 3 mentioned that IE 8 is more compliant with Acid2 than any other browser out there. Apparently, Acid 3 support is still pretty crappy, which is kind of annoying since Acid3 support isn’t that great yet in any browser with significant market share. One would assume that support for Acid3 in IE 8 would drive more rapid adoption of the cool stuff you can do with the standards tested by Acid3, but it’s apparently not part of their plans for the browser. As far as I’m concerned, this is yet more BrowserFail from MS – W3C standards should be followed by any browser. If Mozilla, Opera, and Apple (and therefore Google’s Chrome) can see this, why the hell can’t Microsoft?

Still, as someone who has to get cross-browser support going for IE, I’m happy to see the improvements that version 8 brings. And if I were inclined to ever use a Windows PC, I’d probably be pretty excited about this feature:

A Web Slice grabs specific information from a website (like the top stories from Digg or the weather forecast) and puts it in a drop-down menu, eliminating the need to browse to the actual website. “It’s about making it as easy for sites to extend and blur into the browser,” Hachamovitch told Ars. This is a brilliant feature but it is completely lost if developers ignore it.

This certainly sounds more interesting than the hideous RSS reader thing in Safari, and of a similar but more flexible functionality. Good to see MS doing this kind of stuff, but I agree with the assessment from Ars that unless devs support it, it’ll be useless. And given devs’ well-known affinity (sarcasm) for doing IE-specific work, I’m not seeing this being as big as it should be, given the quality of the innovation.

Designing a rudimentary XML Service with Ruby (Part 2)

A few days back, I posted on my journey of understanding into the world of Ruby-based XML clients. This post is a continuation of that account.

Re-Arranging the WebEx Class

I figured it wouldn’t be long before I was back at the drawing board on my main architecture, and I was right. The earlier one I described turned out to be over-abstracted and hard to test.

It all started with my feeling that this looks really elegant:

1
events = WebEx.request(Event.list)

But in practice it turns out to be a little strange. “Attendee” and “Event” were two classes with no attributes of their own, and none of my code ever instantiated objects of these classes. These are two pretty obvious signs over an over-abstracted implementation. I’d been thinking that I’d write logic later which would (for example) instantiate objects of the Event class inside Event.list’s processor, but as I got more and more into the implementation, it just didn’t seem like I was going to need to mess with WebEx’s return values as discreetly defined objects. After all, I already had Events and Attendees represented as hashes in an array, which was working just fine for this first use case and the ones I could see on the horizon. Having separate classes for Event and Attendee would give me maximum extensibility, but at the cost of having pieces of overlong, over-organized code with no (current) purpose.

So I moved the Event class’s code into the WebEx class. Same with Attendee — now the WebEx class’s code looks like this:

(Gist of the WebEx class)

As you can see, everything is now an instance method of the WebEx object. This means that the syntax for getting a list of Events is now:

1
2
w = WebEx.new
events = w.request(w.event_list)

This still looks a little weird to me. I had been thinking that I should make WebEx#request into a class method, so as to have:

1
events = WebEx.request(w.event_list)

But that would mean having WebEx.request instantiate and return a new object of the WebEx class. There’s nothing wrong with that, but given the fact that another WebEx object already needs to be created in order to call one of its instance methods (event_list), it felt like a case where two objects of the same class which weren’t being used at all in the same way. Because of that awkwardness, I decided to live with the clunky-but-servicable all-instance-method approach. After all – there’s a good chance that I’ll refactor it yet again as I go… :-p

Testing

I’m embarrassed to not have spotted this earlier: the class as it had been written before was very hard to test for a couple major reasons:

  • There was no way to override the XML attribute of one of the WebExmlObjects being returned by class methods Event.list and Attendee.list_for_meeting
  • The HTTP request happened within the WebEx.request method, making it difficult to stub the HTTP request’s response, which had to happen in order to ensure that calling that method during testing didn’t involve net calls.

I solved each of these easily enough: I abstracted the HTTP request into its own method and I added a “payload” argument to each method that returned a WebExmlObject so that I could override its request XML.

After that, it was time to set up some fixtures. I created directories for “request” and “response” in my fixtures dir and added files containing the well-formed XML samples I got from the WebEx docs. Then I wrote methods for opening/reading each of them in my WebExSpecHelper module (this testing is all in RSpec). Below is a test that ensures that WebEx#request is calling WebEx#request_post:

1
2
3
4
it "should call request_post" do
  @w.should_receive(:request_post).and_return(lst_summary_event_response)
  @w.request(@w.event_list)
end

@w is the instance variable that is created before every spec, and lst_summary_event_response is the name of the spec helper method that returns the fixture of that XML response. There’s no particular reason I called this one as opposed to an attendee-related method – I just needed to assert that the call would happen and then stipulate the response it would give, so any of my helper methods would do.

Here’s that helper method doing what it’s meant to:

1
2
3
4
it "it should return all events if passed a nil time limit" do
  @events = @w.event_list(payload=lst_summary_event_request, time_limit=nil).processor.call(@w.doc)
  @events.length.should be(3)
end

There are three events in the fixture, so the length should be three when nothing is passed to time limit. All the fixture data from WebEx was in the past, so I altered the dates in there to have one event in the past, one in the future, and one in the future at a more distant date. Here’s what happens when you pass a time limit past that first (earlier) future date

1
2
3
4
5
6
it "should return only events happening after the time limit" do
  middle_future_date = "04/02/2012 01:06:49"
  limit = @w.time_from_string(middle_future_date)
  @events = @w.event_list(payload=lst_summary_event_request, time_limit=limit).processor.call(@w.doc)
  @events.length.should be(1)
end

Only one result gets returned, because the fixture only has one event listing which has a start date after the date given.

Next Steps

So far, my unit tests have covered very little – basically just the processor portion of a WebExmlObject. For full coverage, I’ll need to test the :x ml attribute which means validating the generated XML against the XML Schema Definitions(XSDs) WebEx provides with their API docs. Ruby doesn’t provide have any all-native tools for doing validation of XML against a given XSD, but the libxml library (which is distributed as a gem and gets its power from C-bindings it compiles at install time) will let you pass in a schema as a string and then validate against it.

Tags: ,

Designing a Rudimentary XML Service with Ruby

1
events = WebEx.request(Event.list)

That felt the most natural to me. The idea is that you have a WebEx class that’s basically responsible for initiating the connection and handling the boilerplate security stuff that’s going to be at the top of any XML API request. The question then became how to structure the Event class so that it could do two things at once: pass in the XML for the request and process the XML from the response.

WebExmlObject and its Subclasses

I decided to define a WebExmlObject class with “xml” and “processor” attributes:

1
2
3
class WebExmlObject
  attr_accessor :xml, :processor
end

Then I could define subclasses for each of the major types of objects I’d be pulling from WebEx (Attendee and Event, to start). For right now, it doesn’t matter that the subclasses themselves don’t define any new attributes or instance methods – it’s worth it to me to do it this way because I believe that they probably will have to in the future. And in any case, I’m a sucker for aesthetics and simplicity, and I wanted the readability and elegance you get from the class method approach I outlined above. Here’s the Event class and its single class method:

(Event class Gist)

Notice that this method is basically in two halves: the first half creates the XML and the second creates the processor as a Proc object. After creating the XML (using the super-handy Builder library), I create a new Event object and load the XML into it. Then I create a Proc and set it to the event object’s :processor attribute. The processor assumes one argument will be passed to it – the XML response as a REXML document – and it returns an array of hashes, each one representing an event. Only the stuff I care about is in each event hash, an easily extendable list represented by the event_keys variable.

The WebEx Class

Next up is the implementation of the request part. The WebEx class has two instance methods: request and response. Request takes a WebExmlObject, adds its XML in as the body of the request, makes the request, parses the response, and returns an object of class WebEx. Response holds the response header. Here’s the full implementation of the class:

(WebEx class Gist)

Putting it Together

I’m still working on the integration part of this, but it’s happening in Sinatra, so I can easily show the short piece that renders the form:

1
2
3
4
5
6
7
8
9
get '/online_demo' do
  events = WebEx.request(Event.list)
  if events.response['result'] == "SUCCESS"
    @events = events.body
    erb :online_demo
  else
    LOGGER.warn('No meetings appear to be available at this time')
  end
end

It’s obviously still a work in progress, but you can see how the XML-to-object abstraction feels pretty well hidden, and the class structure provides the level of elegance and terseness that I was hoping for. If you’re unfamiliar with the way Sinatra works, the above is called when an HTTP GET request is made to the URL “<WEBROOT>/online_demo”. Once the request is made and parsed into the events variable, I create the @events instance variable to hold the actual events array. The last thing I do is call the ERb template “online_demo”. Instance variables created in this block are available to the template in much the same way as the controller/view relationship works in Rails. The view is responsible for iterating over the array and inserting the proper attributes into the proper places:

1
2
3
4
5
6
7
<label for="demo_startdate">Choose a demo date:</label>
  <select name="sessionKey" id="demo-startdate">
    <option value="">Please select</option>
    <% @events.each do |event| %>
      <option value="<%= event['sessionKey'] %>"><%= event['startDate'] %></option>
    <% end %>
  </select>

I’ll post more about this as I move forward, just to document the process of learning how to structure this stuff. I’m still not entirely satisfied with the semantics of the WebEx class’s methods – especially since calling request will set up your object with the data you need and response only gives you access to the header of the response. I pacify myself by reasoning that since one is a class method and one is an instance method, there’s no philosophical/structural problem, but something about it still bugs me.

Big thanks to bona fide code wizard Collin VanDyck for suggestions in this process – as always, I’m proud to be able to call CCV my homie.

Tags: ,

Adding a source/repo to RubyGems

Just because I can never remember this, here’s how to get a new repo source registered in RubyGems – you use the “sources” command — as in this piece for registering GitHub as a source for gems:

gem sources -a http://gems.github.com

There. Now I won’t forget the damn command anymore. Or if I do, it’s here in my other brain…

Tags: