Wrigley on Ice

Tomorrow at 10AM Pacific, I’ll be glued to the set, watching the Blackhawks battle the hated Red Wings.  I’m pretty excited to watch this game though I did try everything possible to get back to Chicago for it.  By not going though, I’m able to watch with the kids and that’ll be much, much better.

Bridges to the Future

Kevin Matheny has written a really excellent piece at BusinessWeek, extolling the virtues of agile software development.  I think it can be one of the toughest battles within a large organization but if you win and are allowed to be flexible, the benefits are easily more than any struggles you’ll have.

What this means for managing projects—including any project that relies on the Internet to deliver its value proposition—is simple: The longer your project timeline, the greater the risk that what you deliver will not be what you or your customers need when you deliver it. Not only are longer-term projects more likely to fail due to changes in requirements or conditions during the project, they’re more expensive. This increases the cost of failure. And because we can only do a few of them in a year, the impact of any one failure is huge.

Dynamic DNS

When I worked at CollabNet many moons ago, I used a laptop for my normal development but also used a desktop as a server for builds and testing things locally.

Since I worked from home, both were behind a router but I didn’t want to pony up money for a static IP so you never knew quite was the IP was if things were reconnected. Obviously this wasn’t a big deal when I was sitting at my desk home but became an issue when I was up at HQ in San Francisco.

My solution was to have a cron job run every hour on the server and compare its IP with the one it had an hour ago. If it changed, it emailed me so I would know.

I mention all of this because Jeremy Zawodny has done the same thing though he has made it much more Web 2.0-compliant by using Twitter.

I imagine I would do exactly that as well if I needed to right now.

The end of Out of Town News

Dave brings word of the closing of Out of Town News in the heart of Harvard Square. While totally understandable, given the current shape newspapers are in, it still is pretty sad for me on a personal level.

When I lived in Cambridge, OoTN was a daily stop on my way to the T. I grabbed the Globe or the New York Times and headed into work. Once my daughter was born, we had a father-daughter walk each Sunday morning. She was in a stroller or a Baby Bjorn and we walked to get the Sunday New York Times. It allowed my wife to have a few minutes peace without a newborn in the house. I look back very, very fondly at that time especially with my daughter now being almost a decade old.

Though some are trying to keep it going, it doesn’t look good for Out Of Town News. I’m sure others have memories of it but these are mine.

Getting Down to the Metal

Rails Metal looks pretty darn awesome. It allows you to specify specific URI paths which will bypass the normal Rails stack, shaving precious milliseconds off your responses and not making the Baby Jesus cry.

As a byproduct with a simple config item, you can start using Rack::Cache which is a very good HTTP cache that will normally give you enough benefits until your traffic really takes off.

Jesse Newland has an even better overview of Metal including an example of using Sinatra with it.

Scoble on Twitter vs. FriendFeed

Robert Scoble posts about why he thinks Twitter is for some people while FriendFeed is not. Wow! I don’t think I’ve ever ready anything more arrogant or pandering. I’ve followed Scoble’s blog since his days before joining UserLand but I can’t recall anything like this.

Personally I haven’t found the need for FriendFeed. I keep up with folks in my social graph pretty well right now with a mixture of Twitter, Facebook and RSS feeds. Of course, I’m not trying to follow the amount of people Scoble is so we are using things different.

I don’t mind lists like this but it really doesn’t have to be so arrogant.

Hadoop’ing at My Desk

photo.jpg

Last week, I started scrounging around the office for some unused PC’s. Unfortunately, they were more than just a few because of all the things going on at the Times. I grabbed three, put them on my desk and spent the rest of a day installing Ubuntu on them. Everything went really smoothly and I was very pleasantly surprised that our IT department didn’t give me a hard time for wanting a switch in the office.

I used this post to help setup a Hadoop cluster. It went really smoothly and before I knew it, the future was sitting on my desk.

Why the future? The amount of data used by companies is increasing way beyond what it used to be and systems like Hadoop allow for that data to be dealt with in more humane ways than stuffing it into some sort of database and hoping your SQL-fu can slice and dice.

Of course a three node Hadoop cluster isn’t that impressive when you compare it to the 4000 node one used by Yahoo!. But that’s ok since this is just the beginning.

What am I doing with all this power you ask? Well, let me give you an example. I have 10 years worth of archives loaded into the cluster. As part of the Articles project, I turned each text file into an Atom representation which has allowed us to do various things with the metadata. At first, I put each individual file into the HDFS (Hadoop Distributed FileSystem) but then I would have needed to write some additional code for Hadoop to look at the files individually as opposed to the default of looking at the selection of lines in each file. Eventually I’ll do that but it would have been yak shaving at the beginning.

Instead, I collapsed files from each month into one, having each line but a story. This allowed Hadoop’s default splitter to go crazy. One of the first Map/Reduce jobs I wrote was to go through each story, find all of the A1 (front page) stories and see who wrote it. That would be the Map part of it while the Reduce piece added all of the instances together so you could easily see the leaderboard. I mentioned this to one of my colleagues and warned me that having that data fall into the wrong hands could destroy the newsroom. I think he was kidding but I’m not that sure.

Other tests have been seeing what the breakdown of sections (News, Sports, Business, etc) have been on the front page, what keywords have been used the most across all 10 years as well as on the front page and more recently, using the keywords to try and train a Naive Bayes classifier using Mahout. That one didn’t really work well but the idea still intrigues me.

In all the talk of the demise of the newspaper, one thing still bothers me. Newspapers are one of the few organizations that has real information about the past, information beyond just the facts. Doing things with this information can only help find the proper place for newspapers and the data they’ve created.

Hadoop isn’t some sort of cure all for the woes we face but I think it gives a glimpse of how a future news organization could use data to do incredible things and give users a much different relationship with the news, one they would renew every day.