Big Data (Again) and NoSQL

October 8, 2009

If you still haven’t read Adam Jacob’s paper “The Pathologies of Big Data” that I linked to previously, go read it.

Also interesting, Dare Obasanjo’s recent post on denormalization in which he makes some similar arguments, complete with examples.

I also had not heard before of the supposed “NoSQL Movement“, that Dare mentions. Very interesting.


Big Data

August 6, 2009

There is a nice article from Adam Jacobs in the most recent CACM magazine entitled “The Pathologies of Big Data”. Jacobs discusses the fact that it is often easier to put data into a database than it is to get data out, as well as strategies for improving how we work with large datasets. It’s an interested read.


A Survey Of Programming Video Cards For Other Purposes

August 3, 2009

The August 2009 issue of ;login: from USENIX has a nice article on programming video cards by Tim Kaldewey entitled “Programming Video Cards For Database Applications”. Sadly, the article is only available to USENIX members until August of 2010.

Kaldewey surveys the past and present of programming video cards for non-graphics purposes – from the early days of using the graphics APIs to fool the GPU into thinking it is rendering graphics when it is really performing a general-purpose calculation, to the present era of general-purpose APIs such as CUDA.

He also shows a back-of-the-envelope calculation for building out a 100 teraflop data center using 100 GPUs versus 1400 CPUs, including power consumption differences.

If you are a USENIX member, the article is a good read. Sadly, it won’t be current when it finally becomes freely available.

[The same issue of ;login: also has a nice article by Leo Meyerovich: “Rethinking Browser Performance“.]


Farewell To Popfly

July 17, 2009

The official Popfly blog states that the service is being shut down.

I’m a little sad to see a neat mashup tool get shut down. The integration with Silverlight and the ability to use the Popfly widgets on the Windows desktop were unique features.

I wish the best of luck to the Popfly team in their next endeavors.


Netflix Prize May Have A Winner

June 29, 2009

The Netflix Prize has entered the 30-day notification period as a team has announced that they have achieved a 10.05% improvement over the original Cinematch algorithm.

Some further background on the contest can be found in a nice writeup in Wired from last year.


Datamining “Where’s George?”

May 5, 2009

The New York Times had an interesting article on Sunday regarding computer modeling of the swine flu epidemic.

The article highlights two university teams that are doing computer models of the spread of the virus. Some of the data being used is obvious: air traffic and commuter traffic data. One data source is not so obvious: the Where’s George? site that lets people track US dollar bills via serial number. I think this is interesting and it apparently provided useful data.

A couple of thoughts:

Read the rest of this entry »


Tracking Used Video Game Prices Over Time

April 21, 2009

I wasn’t previously aware of VideoGamePriceCharts.com, but I learned about them recently through Kotaku. The site tracks the prices of used video games. Of particular interest is their recent article tracking prices of series games when a new installment is released.

The article shows historical data for series games such as Resident Evil, Pokemon and Call of Duty that shows spikes in prices for used copies of the older series installments surrounding the release of the newer installments. This is not entirely surprising, but I’ve never seen real data laid out to support the idea. They also have a posting from last year that shows the release of GTA IV causing a spike in prices for the earlier GTA games.

This has a few interesting implications:

Read the rest of this entry »


Datamining Everquest

March 28, 2009

A group of academic researchers have obtained the complete server logs for the Everquest 2 MMORPG. It’s four years of data for over 400,000 players – the resulting dataset is nearly 60TB. That’s right, terabytes. Combined with some demographic surveys there is interesting datamining potential here.

This is also interesting because apparently the standard tools don’t quite scale to the task of analyzing this data:

Regardless of format, many one-pass, exhaustive algorithms simply choke on a dataset this large, which is forcing his group to use some incremental analysis methods or to work with subsets of the data.

Some items in the results that I found interesting:

Read the rest of this entry »