Using Solr’s AbstractSolrTestCase

This past week I worked on utilizing Solr’s AbstractSolrTestCase which extends JUnit’s TestCase. In theory, this makes it easier to create tests that hit an index and run thru the entire search pipeline if necessary.

Unfortunately, there isn’t a ton of docs to help out but there are plenty of examples within Solr’s source to help.

That being said, here are a few things I found out while working with it.

Because of the way the setUp method worked, I needed to basically duplicate much of its functionality instead of calling super.setUp(). By default, the setUp method will create the data directory for Solr in java.io.tmpdir (generally /tmp on Unix systems) and then the name of the class plus a timestamp. This was a problem for us because it meant that the index would be created for each test in a new directory.

I realize the need for having atomic data for unit tests but I viewed these Solr tests more as integration tests than true unit tests. They were going thru the entire system as opposed to focusing on just one class or section.

To create the index, we were hooking pieces up to our current indexing pipeline, a very nice plug-in system we developed to go through various stations to either clean data or retrieve more of it. Thankfully only a few places actually interacted with Solr so I was able to mock that communication out and just use the data collected and give it to the adoc / update methods.

Because the pipeline wasn’t instantaneous, I wanted to reuse the indexes as much as possible. I figured a good middle ground for this would be for each test class to have its own index and all it to give the indexing pipeline information about what data it wanted to index. That index would stay until a physical directory was deleted and then it would be recreated with updated data.

So I basically had to copy much of the existing setUp method and create the data directory with the test class name but no timestamp as well as make the tearDown method a no-op.

With all of this done, I now have a class which any developer can extend which hopefully will increase our test coverage.

Google Flu Trends

Lots of people are linking to it but Google’s Flu Trends is a pretty amazing site.

The things you can figure out when you have the incredible amount of data Google has access to can provide insights into things previously not possible. I really think the idea that the CDC was up to two weeks behind in noticing the outbreaks is says the most.

You can also download the raw data and display it in other ways if you’d like.

Creating a Search Engine

Rich Skrenta knows a thing or two about search engines and crawlers. Here’s his easy two step process of building your own one.

Step 1 is to copy the internet onto your cluster. Step 2 is to analyze it..

Search is like 7 hard problems wrapped into a stack. Distributed systems, html analytics, text analytics/semantics, anti-spam, AI/ML, frontend/UI. And scale… Apart from the sexy high end algos there are also the boring 10-year old system libraries and off-the-shelf tools that crack under stress and sometimes need a look. You open the hood and wonder how the thing ever worked in the first place…