- Using Solr’s AbstractSolrTestCase
This past week I worked on utilizing Solr’s AbstractSolrTestCase which extends JUnit’s TestCase. In theory, this makes it easier to create tests that hit an index and run thru the entire search pipeline if necessary.
Unfortunately, there isn’t a ton of docs to help out but there are plenty of examples within Solr’s source to help.
That being said, here are a few things I found out while working with it.
Because of the way the setUp method worked, I needed to basically duplicate much of its functionality instead of calling super.setUp(). By default, the setUp method will create the data directory for Solr in java.io.tmpdir (generally /tmp on Unix systems) and then the name of the class plus a timestamp. This was a problem for us because it meant that the index would be created for each test in a new directory.
I realize the need for having atomic data for unit tests but I viewed these Solr tests more as integration tests than true unit tests. They were going thru the entire system as opposed to focusing on just one class or section.
To create the index, we were hooking pieces up to our current indexing pipeline, a very nice plug-in system we developed to go through various stations to either clean data or retrieve more of it. Thankfully only a few places actually interacted with Solr so I was able to mock that communication out and just use the data collected and give it to the adoc / update methods.
Because the pipeline wasn’t instantaneous, I wanted to reuse the indexes as much as possible. I figured a good middle ground for this would be for each test class to have its own index and all it to give the indexing pipeline information about what data it wanted to index. That index would stay until a physical directory was deleted and then it would be recreated with updated data.
So I basically had to copy much of the existing setUp method and create the data directory with the test class name but no timestamp as well as make the tearDown method a no-op.
With all of this done, I now have a class which any developer can extend which hopefully will increase our test coverage.
- A Good Day for Hadoop
Yesterday was a very good day for the Hadoop project.
Yahoo! announced they used a roughly 3800 node cluster to sort thru a Petabyte of data in a little over 16 hours. It’s an amazing feat for any project but especially one with so much potential as Hadoop.
The other good news was the release of mrtoolkit, a map-reduce library written in Ruby. It utilizes Hadoop Streaming and will make it easy to run jobs and crunch data. It comes out of the New York Times dev group and I applaud them.
I’ll have to figure out what the difference is between mrtoolkit and Wukong is so hopefully some sort of merging of the two can happen.
- Testing with Redis
Long time, no blog… But enough about that.
On the side, I’ve been working on a new aggregator, Aggir, which allows me to test various things. I started off using SQLite and Sequel for storage, put Solr behind the scenes for search and added a very simple Web UI using Sinatra and HAML. Yeah, I think I pretty much used all the necessary hot projects. It was fun to build and it works pretty well right now.
I have more to do on the Solr front since I’m just using the defaults for relevance searching. I’d like to dig more into the Solr internals for additional query parsing and classification at index time. It’s some of the stuff I’ve been doing at work but wanted to use a different type of data set.
Of course, now that I had things somewhat stable, I decided to blow it all up and try something new. That something new is Redis using Ezra’s client library.
I started down the path of updating everything, ripping out the database storage to use Redis instead. So far so good, I have the start of this on a branch. One issue I found though was testing my code. It was simple with Sequel since I could create a different database without any worry of overwriting real data. With Redis, I can easily delete keys in between tests but with the keys were the same that a real update would use so non-test data would be deleted.
I think I’ve come up with a solution that at least is working for me. I’ve made each key combine a prefix with other data. The prefixes are defined as class variables. I only set them in the library code if they haven’t already been defined elsewhere. In my test code, I set them with an additional test-specific prefix so that I can easily delete all of the testing keys by using the keys(’test_*’) method. This will allow me to walk thru all of the test keys created during a test and delete them before running the next test. This mirrors what is done with the database.
I’m now able to test on the same instance that I’ve loaded with posts from various blogs. I have more to say about the mindset change from a relational db to key-value storage but I wanted to get this post out.
- EarthLink, Short Term Profit but Long Term?
I worked for EarthLink three different times so it always holds a special place in my heart. It is tough to read things like this though. With all the cuts they’ve done, they were profitable in 2008 but really what does the future hold?
I’ve talked to the few people still there and it really is just a skeleton operation technically and eventually that will need to be cut. It really is too bad since there was always so much promise but really not as much execution.
All of this leaves EarthLink without a clear growth strategy. Once dial-up dies off, the company has no wireless or fixed infrastructure of its own to offer competing services. And even though cost-cutting has helped the company return to profitability, it won’t help solve the company’s fundamental problem, which is a lack of future strategy.
- The Numerati
If asked for a list of books which give a basic overview of the things I do as a coder, I usually suggest Microserfs, Hackers and Painters and maybe something like The Cathedral and the Bazaar. Now though, I think I’ll add The Numerati to that list. It isn’t that my work makes me one of the Numerati but it does give a view of how the world is changing and what sorts of things computer systems will be handing in the future.
I enjoyed this book quite a bit. My only quibble was the lack of real meat in the discussion about the math and the systems but it’s understandable since this was a book for the mainstream not geeks like me.
The ability to take large amounts of data and analyze it would seem to be something only companies would be able to do but I think individuals can do their own now. You could use the combination of EC2, Hadoop and Mahout and become a Numerati yourself.
- The McGwire Brothers
Deadspin posted about Mark McGwire’s brother shopping around a book proposal, showing the truth about his use of steroids and how his brother, Jay, was the first to inject him.
This is pretty weird for me because I played football with Jay when he was a senior and I was a junior. There was always little bits of chatter about his strength and workout routine and some speculated he was getting help.
It’s a crazy, connected world sometimes.
- The New Marching Orders
Now, there are some who question the scale of our ambitions — who suggest that our system cannot tolerate too many big plans. Their memories are short. For they have forgotten what this country has already done; what free men and women can achieve when imagination is joined to common purpose, and necessity to courage.
- National Day of Service
Somehow I missed Monday is being changed from just a holiday to a day of service. Unfortunately, I have to work but perhaps I can still figure out a way to get involved.
Seth Godin put together a great list of possible ways to make an impact in the world.
Obie and the crew at Hashrocket are going to build things for the Apps for America contest.
I don’t know what I’m going to do but I want it to be meaningful whether that is Monday or beyond.
- Flow Chart for Data Visualization
When you are wondering how best to show data, use this flow chart. [via]
- 2009: The Year of LinkedIn
In the current economic struggles, I think a site like LinkedIn is going to become more and more important. Sarah Lacy posts about this with the view of how LinkedIn’s engagement is going to up since more people will be looking for work and using their social contacts to find work.
