A Good Day for Hadoop

Yesterday was a very good day for the Hadoop project.

Yahoo! announced they used a roughly 3800 node cluster to sort thru a Petabyte of data in a little over 16 hours. It’s an amazing feat for any project but especially one with so much potential as Hadoop.

The other good news was the release of mrtoolkit, a map-reduce library written in Ruby. It utilizes Hadoop Streaming and will make it easy to run jobs and crunch data. It comes out of the New York Times dev group and I applaud them.

I’ll have to figure out what the difference is between mrtoolkit and Wukong is so hopefully some sort of merging of the two can happen.

Hadoop’ing at My Desk

photo.jpg

Last week, I started scrounging around the office for some unused PC’s. Unfortunately, they were more than just a few because of all the things going on at the Times. I grabbed three, put them on my desk and spent the rest of a day installing Ubuntu on them. Everything went really smoothly and I was very pleasantly surprised that our IT department didn’t give me a hard time for wanting a switch in the office.

I used this post to help setup a Hadoop cluster. It went really smoothly and before I knew it, the future was sitting on my desk.

Why the future? The amount of data used by companies is increasing way beyond what it used to be and systems like Hadoop allow for that data to be dealt with in more humane ways than stuffing it into some sort of database and hoping your SQL-fu can slice and dice.

Of course a three node Hadoop cluster isn’t that impressive when you compare it to the 4000 node one used by Yahoo!. But that’s ok since this is just the beginning.

What am I doing with all this power you ask? Well, let me give you an example. I have 10 years worth of archives loaded into the cluster. As part of the Articles project, I turned each text file into an Atom representation which has allowed us to do various things with the metadata. At first, I put each individual file into the HDFS (Hadoop Distributed FileSystem) but then I would have needed to write some additional code for Hadoop to look at the files individually as opposed to the default of looking at the selection of lines in each file. Eventually I’ll do that but it would have been yak shaving at the beginning.

Instead, I collapsed files from each month into one, having each line but a story. This allowed Hadoop’s default splitter to go crazy. One of the first Map/Reduce jobs I wrote was to go through each story, find all of the A1 (front page) stories and see who wrote it. That would be the Map part of it while the Reduce piece added all of the instances together so you could easily see the leaderboard. I mentioned this to one of my colleagues and warned me that having that data fall into the wrong hands could destroy the newsroom. I think he was kidding but I’m not that sure.

Other tests have been seeing what the breakdown of sections (News, Sports, Business, etc) have been on the front page, what keywords have been used the most across all 10 years as well as on the front page and more recently, using the keywords to try and train a Naive Bayes classifier using Mahout. That one didn’t really work well but the idea still intrigues me.

In all the talk of the demise of the newspaper, one thing still bothers me. Newspapers are one of the few organizations that has real information about the past, information beyond just the facts. Doing things with this information can only help find the proper place for newspapers and the data they’ve created.

Hadoop isn’t some sort of cure all for the woes we face but I think it gives a glimpse of how a future news organization could use data to do incredible things and give users a much different relationship with the news, one they would renew every day.

Looking into HBase

HBase is the open source implementation of Google’s Bigtable. I’ve been keeping my eye on it in combination with Hadoop. I had some extra time today so I decided to see how easy it would be to hook it up with the aggregator we built for things like Topics.

One of the nice things about HBase is the REST interface that can read and write data. I hooked up the Ruby client so that whenever I saved posts from the feed to MySQL, it would also send data to HBase.

The writing to HBase is pretty straightforward and the REST client makes it really easy. However, getting the data out needs to be looked at a bit more closely.

HBase is NOT a relational database. If you approach like it is, you will get utterly confused and frustrated. Instead, it can be thought of as a collection of Maps. So, in order to get data out, you need to iterate over the Maps looking for particular columns.

When you use the REST API, you do this via the creation of a scanner and pop‘ing off the results like from a queue.

That’s some of what I found out, let’s see what else I can dig into today.