Realtime Datamining Of Location Data

March 16, 2010

All the recent coverage of PleaseRobMe.com has focused on one particular idea of what can be done with realtime datamining of location data. (And how long until we see mashups combining PleaseRobMe.com with social cataloging sites such as LibraryThing?)

Where My Ladies At? (link goes to John Bollozos’ descriptive writeup) is another such site – it mines data from apps such as FourSquare and Gowalla and compares it against a database of female names in order to identify places where lots of females are at a given moment. I’m really at a loss to say more about it.

See also Jim Bumgardner’s “Mayor Of The North Pole” for a perspective on forging location data.

[Tip o' the hat to Dan G for the Where My Ladies At? pointer.]


Another Profile Of Demand Media

March 16, 2010

Five months after the Wired article, Time Magazine has a recent profile of Demand Media.

The article is short and doesn’t have much additional data compared to the Wired article. However, the author submitted approximately 20 articles to Demand, and did some experimenting such as including factual errors to see if they would be caught.


Datamining Facebook

February 9, 2010

Just a quick post of two links about datamining Facebook:

Pete’s divisions of the US are interesting to consider.

(I haven’t read all the commentary yet, but it’s clear that Mr. Warden is needing some pop culture data to help him understand why Ashley is a popular name in the South and why Twilight is popular in Utah. I think the answers are fairly self-evident.)


Datamining The Government

February 3, 2010

The article is a bit old now (June 2009), but Wired had an interesting interview with Vivek Kundra about Data.gov. This is the usual Web 2.0 pitch about making data transparent and available and hoping that crowdsourcing will magically create useful things.

I will however admit to having been intrigued by the concept of one of the apps mentioned in the article:

In DC, someone combined several of the data sets released by local government—maps, liquor license info, crime statistics—into an app called Stumble Safely, which shows users the safest way to walk home when drunk.

Now someone just needs to mash it up with some augmented reality software or turn-by-turn GPS directions (“Turn left at the next corner. Stagger 2 blocks east. Try not to walk into that telephone pole.”).

Also of interest is DataMasher, which I spotted on LifeHacker. It appears to be a site for mashing up various government data sources. You can also save your own mashup and make it available to others. Looks interesting. So far, the Highest Rated and Most Discussed mashups seem to focus on health, mortality, guns, alcohol, obesity, and reproduction.

Finally, if you’ve read Freakonomics, check out this article on “bad boy” baby names as a predictor of behavior.


Datamining Games

February 3, 2010

The rise of the online always-on videogame opens a new world of stat tracking. The recent changes is this area are well beyond simple high score boards or achievements/trophies. For example, consider the article “You Are Being Watched” from a recent issue of the Official Xbox Magazine. The article details the datamining that Bungie is doing for Halo 3 and Halo 3: ODST, that Criterion is doing for Burnout Paradise, and Valve is doing for Team Fortress 2 and Left 4 Dead.

All of these companies are gathering data that shows them how their games are really being played. One usage for this data is to potentially make improvements and bug fixes. In the case of Bungie, players can actually log onto bungie.net and see their own stats and own personal heat maps for the matches they have played. Valve shares some of the overall data, and has recently started adding personalized data (for Steam players only).

For the personalized data, it would be interesting to see some numbers for how many players actually review their stats and whether it has an impact on their playing.

See also:

While I’m clearing out the videogame datamining links…


Search Box Candor

January 30, 2010

It is becoming increasing clear that we don’t lie to search engines. As the AOL search data scandal revealed, you can give away your identity simply through egosurfing.

People ask all sorts of questions to search engines. And the autocomplete features recently added to the search boxes at Google and Bing are quite revealing about what things people are searching for. This is most readily pointed out with two recent articles at Slate. The first article has a number of interesting examples of what people are typing into the Google search box, and calls for submissions from readers. It’s the second article that is the most interesting – consider the difference in suggestions that Google provides based upon your grammar – the difference in suggestions for “is it wrong to” compared with “is it ethical to” is quite interesting.

Which brings us to the outing of anonymous blogger Belle de Jour. It is not especially surprising that her identity was figured out from her online writings. What is interesting is that someone figured out her identity, kept it secret, and used a Googlewhack in order to spot when others began to suspect her identity around six years later.

Update Feb 23, ’10: See also AutoComplete Me for more Google examples.


Plotting Social Networks (Of Fictional Characters) Over Time

January 30, 2010

From the non-academic world, some infographics charting social network interactions over time. In this case, the source is the xkcd comic strip – and the social networks are Star Wars, The Lord of the Rings and three other films.

The orcs in the Lord of the Rings graphic are particularly reminiscent of Minard’s Napolean map.


FlowingData’s Best Visualizations Of 2009

January 30, 2010

The FlowingData site has recently posted their Best Visualizations Of 2009. Of particular interest is Ben Fry’s work with Charles Darwin’s text.


A Survey Of Streaming SQL

January 29, 2010

The latest issue of CACM has an article entitled “Data In Flight” by Julian Hyde (chief architect of SQLstream). The article is a survey of streaming SQL technology and how it may apply to ever increasing datastreams.

I will also highlight two small items out of the article. The first is an assertion that web application authors are generalists:

The technologies for powering Web applications must be fairly straightforward for two reasons: first, because it must be possible to evolve a Web application rapidly and then to deploy it at scale with a minimum of hassle; second, because the people writing Web applications are generalists and are not prepared to learn the kind of complex, hard-to-tune technologies used by systems programmers.

And second, about 2/3 of the way through the article he finally makes the logical connection to CEP, and throws in an aside about an ongoing religious war. Is this the CEP/Rete debate that I am aware of, or some other debate?

CEP has been used within the industry as a blanket term to describe the entire field of streaming query systems. This is regrettable because it has resulted in a religious war between SQL-based and non-SQL-based vendors and, in overly focusing on financial services applications, has caused other application areas to be neglected.


Graphing The Beatles

January 20, 2010

Spotted this week (on waxy.org?), an interesting project that is creating a bunch of infographics using data from The Beatles. I especially like the lyric self-reference graphic.