Chernoff Face Tutorial on Flowing Data

September 20, 2010

If you’ve read Blindsight, then you have come across Chernoff faces.

I recently spotted a tutorial on Flowing Data for using Chernoff faces with R.


Realtime Datamining Of Location Data

March 16, 2010

All the recent coverage of PleaseRobMe.com has focused on one particular idea of what can be done with realtime datamining of location data. (And how long until we see mashups combining PleaseRobMe.com with social cataloging sites such as LibraryThing?)

Where My Ladies At? (link goes to John Bollozos’ descriptive writeup) is another such site – it mines data from apps such as FourSquare and Gowalla and compares it against a database of female names in order to identify places where lots of females are at a given moment. I’m really at a loss to say more about it.

See also Jim Bumgardner’s “Mayor Of The North Pole” for a perspective on forging location data.

[Tip o' the hat to Dan G for the Where My Ladies At? pointer.]


Another Profile Of Demand Media

March 16, 2010

Five months after the Wired article, Time Magazine has a recent profile of Demand Media.

The article is short and doesn’t have much additional data compared to the Wired article. However, the author submitted approximately 20 articles to Demand, and did some experimenting such as including factual errors to see if they would be caught.


Datamining Facebook

February 9, 2010

Just a quick post of two links about datamining Facebook:

Pete’s divisions of the US are interesting to consider.

(I haven’t read all the commentary yet, but it’s clear that Mr. Warden is needing some pop culture data to help him understand why Ashley is a popular name in the South and why Twilight is popular in Utah. I think the answers are fairly self-evident.)


Datamining The Government

February 3, 2010

The article is a bit old now (June 2009), but Wired had an interesting interview with Vivek Kundra about Data.gov. This is the usual Web 2.0 pitch about making data transparent and available and hoping that crowdsourcing will magically create useful things.

I will however admit to having been intrigued by the concept of one of the apps mentioned in the article:

In DC, someone combined several of the data sets released by local government—maps, liquor license info, crime statistics—into an app called Stumble Safely, which shows users the safest way to walk home when drunk.

Now someone just needs to mash it up with some augmented reality software or turn-by-turn GPS directions (“Turn left at the next corner. Stagger 2 blocks east. Try not to walk into that telephone pole.”).

Also of interest is DataMasher, which I spotted on LifeHacker. It appears to be a site for mashing up various government data sources. You can also save your own mashup and make it available to others. Looks interesting. So far, the Highest Rated and Most Discussed mashups seem to focus on health, mortality, guns, alcohol, obesity, and reproduction.

Finally, if you’ve read Freakonomics, check out this article on “bad boy” baby names as a predictor of behavior.


Datamining Games

February 3, 2010

The rise of the online always-on videogame opens a new world of stat tracking. The recent changes is this area are well beyond simple high score boards or achievements/trophies. For example, consider the article “You Are Being Watched” from a recent issue of the Official Xbox Magazine. The article details the datamining that Bungie is doing for Halo 3 and Halo 3: ODST, that Criterion is doing for Burnout Paradise, and Valve is doing for Team Fortress 2 and Left 4 Dead.

All of these companies are gathering data that shows them how their games are really being played. One usage for this data is to potentially make improvements and bug fixes. In the case of Bungie, players can actually log onto bungie.net and see their own stats and own personal heat maps for the matches they have played. Valve shares some of the overall data, and has recently started adding personalized data (for Steam players only).

For the personalized data, it would be interesting to see some numbers for how many players actually review their stats and whether it has an impact on their playing.

See also:

While I’m clearing out the videogame datamining links…


Search Box Candor

January 30, 2010

It is becoming increasing clear that we don’t lie to search engines. As the AOL search data scandal revealed, you can give away your identity simply through egosurfing.

People ask all sorts of questions to search engines. And the autocomplete features recently added to the search boxes at Google and Bing are quite revealing about what things people are searching for. This is most readily pointed out with two recent articles at Slate. The first article has a number of interesting examples of what people are typing into the Google search box, and calls for submissions from readers. It’s the second article that is the most interesting – consider the difference in suggestions that Google provides based upon your grammar – the difference in suggestions for “is it wrong to” compared with “is it ethical to” is quite interesting.

Which brings us to the outing of anonymous blogger Belle de Jour. It is not especially surprising that her identity was figured out from her online writings. What is interesting is that someone figured out her identity, kept it secret, and used a Googlewhack in order to spot when others began to suspect her identity around six years later.

Update Feb 23, ’10: See also AutoComplete Me for more Google examples.


FlowingData’s Best Visualizations Of 2009

January 30, 2010

The FlowingData site has recently posted their Best Visualizations Of 2009. Of particular interest is Ben Fry’s work with Charles Darwin’s text.


Graphing The Beatles

January 20, 2010

Spotted this week (on waxy.org?), an interesting project that is creating a bunch of infographics using data from The Beatles. I especially like the lyric self-reference graphic.


“an extra dollar for fact-checking”

November 6, 2009

The latest Wired magazine has an interesting article on Demand Media. If you’ve ever used sites such as eHow then you may have encountered Demand Media without even realizing it.

Demand Media generates web content – a lot of it. It appears that they have an algorithm that analyzes popular web search terms, advertisement rates and their competition – and spits out ideas for content. The example output shown in the article is “how to make butterflies for cake decorating”. That’s after two proof readers have munged the set of terms from the original output into a sentence. (I don’t know if this example is contrived or real, but it does lead to a real article.)

Once they have a topic, they use freelancers to create articles and/or video tutorials. They pay as much as $20 per clip to the filmmakers, whereas the title proofers get 8 cents per headline.

These folks are pumping out enormous amounts of content. The article says that by next summer they will be publishing 1 million items a month. They already have 170,000 videos on YouTube.

Anyway, the article is an interesting read. I’ll close with a quote:

“We’re not talking about $1,000 videos, so a couple dollars here or there can make a serious difference. For instance, pay an extra dollar for fact-checking.”