<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>lucasjosh.com &#187; Data</title>
	<atom:link href="http://lucasjosh.com/blog/category/data/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucasjosh.com/blog</link>
	<description></description>
	<lastBuildDate>Mon, 01 Mar 2010 14:15:08 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Testing with Redis</title>
		<link>http://lucasjosh.com/blog/2009/03/18/testing-with-redis/</link>
		<comments>http://lucasjosh.com/blog/2009/03/18/testing-with-redis/#comments</comments>
		<pubDate>Wed, 18 Mar 2009 19:16:04 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=265</guid>
		<description><![CDATA[Long time, no blog&#8230;  But enough about that.
On the side, I&#8217;ve been working on a new aggregator, Aggir, which allows me to test various things.  I started off using SQLite and Sequel for storage, put Solr behind the scenes for search and added a very simple Web UI using Sinatra and HAML.  [...]]]></description>
			<content:encoded><![CDATA[<p>Long time, no blog&#8230;  But enough about that.</p>
<p>On the side, I&#8217;ve been working on a new aggregator, <a href="http://github.com/lucasjosh/aggir/tree/master">Aggir</a>, which allows me to test various things.  I started off using SQLite and Sequel for storage, put Solr behind the scenes for search and added a very simple Web UI using Sinatra and HAML.  Yeah, I think I pretty much used all the necessary <i>hot</i> projects.  It was fun to build and it works pretty well right now.    </p>
<p>I have more to do on the Solr front since I&#8217;m just using the defaults for relevance searching.  I&#8217;d like to dig more into the Solr internals for additional query parsing and classification at index time.  It&#8217;s some of the stuff I&#8217;ve been doing at work but wanted to use a different type of data set.</p>
<p>Of course, now that I had things somewhat stable, I decided to blow it all up and try something new.  That something new is <a href="http://code.google.com/p/redis/">Redis</a> using <a href="http://github.com/ezmobius/redis-rb/tree/master">Ezra&#8217;s client library</a>.  </p>
<p>I started down the path of updating everything, ripping out the database storage to use Redis instead.  So far so good, I have <a href="http://github.com/lucasjosh/aggir/tree/redis-storage">the start of this on a branch</a>.  One issue I found though was testing my code.  It was <i>simple</i> with Sequel since I could create a different database without any worry of overwriting real data.  With Redis, I can easily delete keys in between tests but with the keys were the same that a real update would use so non-test data would be deleted. </p>
<p>I think I&#8217;ve come up with a solution that at least is working for me.  I&#8217;ve made each key combine a prefix with other data.  The prefixes are defined as class variables.  I only set them in the library code if they haven&#8217;t already been defined elsewhere.  In my test code, I set them with an additional test-specific prefix so that I can easily delete all of the testing keys by using the keys(&#8216;test_*&#8217;) method.  This will allow me to walk thru all of the test keys created during a test and delete them before running the next test.  This mirrors what is done with the database.</p>
<p>I&#8217;m now able to test on the same instance that I&#8217;ve loaded with posts from various blogs.  I have more to say about the mindset change from a relational db to key-value storage but I wanted to get this post out.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2009/03/18/testing-with-redis/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Flow Chart for Data Visualization</title>
		<link>http://lucasjosh.com/blog/2009/01/15/flow-chart-for-data-visualization/</link>
		<comments>http://lucasjosh.com/blog/2009/01/15/flow-chart-for-data-visualization/#comments</comments>
		<pubDate>Thu, 15 Jan 2009 15:21:17 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=245</guid>
		<description><![CDATA[When you are wondering how best to show data, use this flow chart.  [via]
]]></description>
			<content:encoded><![CDATA[<p>When you are wondering how best to show data, <a href="http://www.flickr.com/photos/amit-agarwal/3196386402/sizes/l/">use this flow chart</a>.  [<a href="http://flowingdata.com/2009/01/15/flow-chart-shows-you-what-chart-to-use/">via</a>]</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2009/01/15/flow-chart-for-data-visualization/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hadoop&#8217;ing at My Desk</title>
		<link>http://lucasjosh.com/blog/2008/12/01/hadooping-at-my-desk/</link>
		<comments>http://lucasjosh.com/blog/2008/12/01/hadooping-at-my-desk/#comments</comments>
		<pubDate>Mon, 01 Dec 2008 21:53:16 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Newspapers]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=206</guid>
		<description><![CDATA[
Last week, I started scrounging around the office for some unused PC&#8217;s.  Unfortunately, they were more than just a few because of all the things going on at the Times.  I grabbed three, put them on my desk and spent the rest of a day installing Ubuntu on them.  Everything went really [...]]]></description>
			<content:encoded><![CDATA[<div style="text-align:center;"><img src="http://lucasjosh.com/blog/wp-content/uploads/2008/11/photo.jpg" alt="photo.jpg" border="0" width="400"  /></div>
<p>Last week, I started scrounging around the office for some unused PC&#8217;s.  Unfortunately, they were more than just a few because of all the things going on at the Times.  I grabbed three, put them on my desk and spent the rest of a day installing <a href="http://ubuntu.com/">Ubuntu</a> on them.  Everything went really smoothly and I was very pleasantly surprised that our IT department didn&#8217;t give me a hard time for wanting a switch in the office.</p>
<p>I used <a href="http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)">this post</a> to help setup a <a href="http://hadoop.apache.org">Hadoop</a> cluster.  It went really smoothly and before I knew it, the future was sitting on my desk.</p>
<p>Why the future?  <a href="http://www.newscientist.com/article/dn16162-what-the-data-miners-are-digging-up-about-you.html">The amount of data used by companies</a> is increasing way beyond what it used to be and systems like Hadoop allow for that data to be dealt with in more humane ways than stuffing it into some sort of database and hoping your SQL-fu can slice and dice.</p>
<p>Of course a three node Hadoop cluster isn&#8217;t that impressive when you compare it to the <a href="http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html">4000 node one used by Yahoo!</a>.  But that&#8217;s ok since this is just the beginning.</p>
<p>What am I doing with all this power you ask?  Well, let me give you an example.  I have 10 years worth of archives loaded into the cluster.  As part of the <a href="http://articles.latimes.com">Articles project</a>, I turned each text file into an Atom representation which has allowed us to do various things with the metadata.  At first, I put each individual file into the HDFS (Hadoop Distributed FileSystem) but then I would have needed to write some additional code for Hadoop to look at the files individually as opposed to the default of looking at the selection of lines in each file.  Eventually I&#8217;ll do that but it would have been yak shaving at the beginning.</p>
<p>Instead, I collapsed files from each month into one, having each line but a story.  This allowed Hadoop&#8217;s default splitter to go crazy.  One of the first Map/Reduce jobs I wrote was to go through each story, find all of the A1 (front page) stories and see who wrote it.  That would be the Map part of it while the Reduce piece added all of the instances together so you could easily see the leaderboard.  I mentioned this to one of <a href="http://www.palewire.com/">my colleagues</a> and warned me that having that data fall into the wrong hands could destroy the newsroom.  I think he was kidding but I&#8217;m not that sure.</p>
<p>Other tests have been seeing what the breakdown of sections (News, Sports, Business, etc) have been on the front page, what keywords have been used the most across all 10 years as well as on the front page and more recently, using the keywords to try and train a Naive Bayes classifier using <a href="http://cwiki.apache.org/MAHOUT/index.html">Mahout</a>.  That one didn&#8217;t really work well but the idea still intrigues me.</p>
<p>In all the talk of the demise of the newspaper, one thing still bothers me.  Newspapers are one of the few organizations that has real information about the past, information beyond just the facts.  Doing things with this information can only help find the proper place for newspapers and the data they&#8217;ve created.  </p>
<p>Hadoop isn&#8217;t some sort of cure all for the woes we face but I think it gives a glimpse of how a future news organization could use data to do incredible things and give users a much different relationship with the news, one they would renew every day.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2008/12/01/hadooping-at-my-desk/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Google Flu Trends</title>
		<link>http://lucasjosh.com/blog/2008/11/11/google-flu-trends/</link>
		<comments>http://lucasjosh.com/blog/2008/11/11/google-flu-trends/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 07:15:07 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=199</guid>
		<description><![CDATA[Lots of people are linking to it but Google&#8217;s Flu Trends is a pretty amazing site.  
The things you can figure out when you have the incredible amount of data Google has access to can provide insights into things previously not possible.  I really think the idea that the CDC was up to [...]]]></description>
			<content:encoded><![CDATA[<p>Lots of people are linking to it but <a href="http://www.google.org/flutrends/">Google&#8217;s Flu Trends</a> is a pretty amazing site.  </p>
<p>The things you can figure out when you have the incredible amount of data Google has access to can provide insights into things previously not possible.  I really think the idea that the CDC was up to two weeks behind in noticing the outbreaks is says the most.</p>
<p>You can also <a href="http://www.google.org/about/flutrends/download.html">download the raw data</a> and display it in other ways if you&#8217;d like.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2008/11/11/google-flu-trends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Living Behind the Pay Wall</title>
		<link>http://lucasjosh.com/blog/2008/10/15/living-behind-the-pay-wall/</link>
		<comments>http://lucasjosh.com/blog/2008/10/15/living-behind-the-pay-wall/#comments</comments>
		<pubDate>Wed, 15 Oct 2008 21:57:19 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Newspapers]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=180</guid>
		<description><![CDATA[Techdirt has two really good posts today about making information hard-to-find when customers are looking for it.
The first deals directly with it by looking at newspapers holding their archives hostage by putting up a pay wall in front of them after a certain amount of time has passed.  This is silly and yes, I [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.techdirt.com">Techdirt</a> has two really good posts today about making information hard-to-find when customers are looking for it.</p>
<p>The first deals directly with it by looking at <a href="http://techdirt.com/articles/20081010/1704352520.shtml">newspapers holding their archives hostage</a> by putting up a pay wall in front of them after a certain amount of time has passed.  This is silly and yes, I know <a href="http://www.latimes.com">we</a> do it officially but I&#8217;ve been fighting that since I started here.  It&#8217;s one of the main reasons why a few of us <a href="http://articles.latimes.com">created this</a>.  It seems pretty easy to me to see the benefit of doing this.  We haven&#8217;t promoted anything about the article server at all yet people find us thru search engines.  It&#8217;s really quite simple.</p>
<p>The second post is about <a href="http://techdirt.com/articles/20081014/0146592539.shtml">Howard Stern and his shrinking influence</a> since his move off of free radio and onto satellite.  It&#8217;s based on <a href="http://www.latimes.com/entertainment/news/la-et-stern13-2008oct13,0,7473563.story">one of our articles</a>.</p>
<p>Overall, you either make your information easily found by users or they will route around you, looking elsewhere and more than likely ignoring you forever.  It&#8217;s your choice.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2008/10/15/living-behind-the-pay-wall/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Campaign Finance API from the NY Times</title>
		<link>http://lucasjosh.com/blog/2008/10/14/campaign-finance-api-from-the-ny-times/</link>
		<comments>http://lucasjosh.com/blog/2008/10/14/campaign-finance-api-from-the-ny-times/#comments</comments>
		<pubDate>Wed, 15 Oct 2008 04:12:56 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Newspapers]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=172</guid>
		<description><![CDATA[The New York Times opened up an API to get data about campaign financing.  Yes, I&#8217;m jealous that they did this before we did.
]]></description>
			<content:encoded><![CDATA[<p>The New York Times opened up <a href="http://open.blogs.nytimes.com/2008/10/14/announcing-the-new-york-times-campaign-finance-api/">an API to get data about campaign financing</a>.  Yes, I&#8217;m jealous that they did this before we did.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2008/10/14/campaign-finance-api-from-the-ny-times/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Molten Data and NPR</title>
		<link>http://lucasjosh.com/blog/2008/07/16/molten-data-and-npr/</link>
		<comments>http://lucasjosh.com/blog/2008/07/16/molten-data-and-npr/#comments</comments>
		<pubDate>Thu, 17 Jul 2008 06:04:45 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[LATimes]]></category>
		<category><![CDATA[Newspapers]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=85</guid>
		<description><![CDATA[Update: Jeff Jarvis asks a great question about what people could do with this data.  It&#8217;ll be fun to find out.
I was re-reading Matt Waite&#8217;s post on molten data and then read about NPR releasing an API for parts of their content.  The two seem linked.
I know the East Coast Times is working [...]]]></description>
			<content:encoded><![CDATA[<p><em>Update</em>: Jeff Jarvis <a href="http://www.buzzmachine.com/2008/07/17/the-api-times/">asks a great question</a> about what people could do with this data.  It&#8217;ll be fun to find out.</p>
<p>I was re-reading <a href="http://mattwaite.com/2008/jan/11/molten-content-data-ghettos-and-why-your-CMS-problems-are-an-excuse-not-a-reason/">Matt Waite&#8217;s post on molten data</a> and then <a href="http://www.techcrunch.com/2008/07/16/npr-launches-api-that-serves-up-13-years-of-content/">read about</a> <a href="http://www.npr.org/api/index">NPR releasing an API</a> for parts of their content.  The two seem linked.</p>
<p>I know the <a href="http://www.nytimes.com">East Coast Times</a> is working on some sort of API but I&#8217;ve been thinking about how we could open things up and allow folks access to so much of our good stuff.  Why not start with just articles, using dates, keywords or writers as the inputs.  Moving on from there, you could add photos, video and then more of our data apps.  That seems pretty straightforward to me.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2008/07/16/molten-data-and-npr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fun with brands</title>
		<link>http://lucasjosh.com/blog/2008/06/10/fun-with-brands/</link>
		<comments>http://lucasjosh.com/blog/2008/06/10/fun-with-brands/#comments</comments>
		<pubDate>Wed, 11 Jun 2008 05:15:00 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/?p=83</guid>
		<description><![CDATA[Interesting timeline looking at the brands we use during the day.  Could be a fun project and make for very interesting trends.
]]></description>
			<content:encoded><![CDATA[<p><a href="http://dearjanesample.wordpress.com/2008/05/19/fun-with-brands/">Interesting timeline</a> looking at the brands we use during the day.  Could be a fun project and make for very interesting trends.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2008/06/10/fun-with-brands/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tracking Manny&#8217;s Homers</title>
		<link>http://lucasjosh.com/blog/2008/05/16/tracking-mannys-homers/</link>
		<comments>http://lucasjosh.com/blog/2008/05/16/tracking-mannys-homers/#comments</comments>
		<pubDate>Fri, 16 May 2008 16:04:32 +0000</pubDate>
		<dc:creator>josh</dc:creator>
				<category><![CDATA[Baseball]]></category>
		<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://lucasjosh.com/blog/2008/05/16/tracking-mannys-homers/</guid>
		<description><![CDATA[The Boston Globe has put together an very cool mini-app that tracks all of Manny Ramirez&#8217;s home runs as he approaches number 500.
You can break it down in all sorts of data goodness from ballpark to pitch count.  Ah data, the things you can do with it.
]]></description>
			<content:encoded><![CDATA[<p>The Boston Globe has put together an <a href="http://www.boston.com/sports/baseball/redsox/extras/manny_500_homeruns/">very cool mini-app</a> that tracks all of Manny Ramirez&#8217;s home runs as he approaches number 500.</p>
<p>You can break it down in all sorts of data goodness from ballpark to pitch count.  Ah data, the things you can do with it.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucasjosh.com/blog/2008/05/16/tracking-mannys-homers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
