Tuesday, May 17, 2011

Tracking Wikipedia page-hopping


I have realized that like most users of Wikipedia, I do a lot of wikipedia page-hopping [1]. Wikipedia is sort of addictive that way. You start reading about a piece of Flamenco music and after twenty minutes find yourself staring at the page about ETA [2], a Basque nationalist organization. So I decided to figure out how I exactly do I get lost in the huge list of interconnected articles. I use Chromium, and it stores its history in a SQLite3 database file. I wrote a small Ruby script that parses the history, splits them to chunks of articles accessed per day, and filter only wikipedia links from this.

This is basically what I had to do:
* Query the db for the last visit time and URLs.
* Chromium (and Google Chrome) stores timestamps of page visits in a not so obvious format. They basically store time stamps as the number of micro seconds expired since Jan 01, 1601 [3]
* Splitting the URLs into chunks accessed per day involved calculating the number of micro-seconds in a day and splitting the URLs based on this. Ruby's Array#group_by is really handy here.
* Analysis of the URLs involves filtering only the URLs that contain "wikipedia"
* There is a caveat here, as redirects to Wikipedia from both Google and Facebook contain the string "wikipedia" in their URLs. These need to be filtered out.

The analysis of my Wikipedia history showed me some interesting things. For example, when I was reading Michael J. Arlen's [4] Passage to Ararat, I spent a lot of time on Wikipedia, hopping between pages about Armenian history and culture. This is what the list of Wikipedia pages on that day look like:

http://en.wikipedia.org/wiki/TRS-80
http://en.wikipedia.org/wiki/Aunt_Sally
http://en.wikipedia.org/wiki/Pai
http://en.wikipedia.org/wiki/Pai_(surname)
http://en.wikipedia.org/wiki/Gowd_Saraswat_Brahmins
http://en.wikipedia.org/wiki/Girish_Karnad
http://en.wikipedia.org/wiki/Konkani_people
http://en.wikipedia.org/wiki/Roots_(book)
http://en.wikipedia.org/wiki/Mountains_of_Ararat
http://en.wikipedia.org/wiki/Armenian_Highland
http://en.wikipedia.org/wiki/Searches_for_Noah%27s_Ark
http://en.wikipedia.org/wiki/Tiberian_Hebrew
http://en.wikipedia.org/wiki/Mount_Judi
http://en.wikipedia.org/wiki/Islamic_view_of_Noah
http://en.wikipedia.org/wiki/Greater_Armenia_(political_concept)
http://en.wikipedia.org/wiki/Coat_of_Arms_of_Armenia
http://en.wikipedia.org/wiki/Turkish_War_of_Independence
http://en.wikipedia.org/wiki/Kuva-yi_Milliye
http://en.wikipedia.org/wiki/Turkish-Armenian_War
http://en.wikipedia.org/wiki/Anatolia
http://en.wikipedia.org/wiki/Bursa
http://en.wikipedia.org/wiki/Armenia
http://en.wikipedia.org/wiki/Hayk
http://en.wikipedia.org/wiki/Ecbatana
http://en.wikipedia.org/wiki/Goat_meat
http://en.wikipedia.org/wiki/Kid
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=xenophon
http://en.wikipedia.org/wiki/Xenophon
http://en.wikipedia.org/wiki/International_Mother_Language_Day
http://en.wikipedia.org/wiki/Debian
http://en.wikipedia.org/wiki/Language_Movement_Day

When I was reading about Data warehousing, this is how the hopping happened:

http://en.wikipedia.org/wiki/ROLAP
http://en.wikipedia.org/wiki/Dimension_(data_warehouse)
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Mondrian_OLAP_server
http://en.wikipedia.org/wiki/OLAP
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
http://en.wikipedia.org/wiki/Pentaho
http://en.wikipedia.org/wiki/Multidimensional_Expressions
http://en.wikipedia.org/wiki/Decision_science
http://en.wikipedia.org/wiki/Star_schema
http://en.wikipedia.org/wiki/Snowflake_schema
http://en.wikipedia.org/wiki/Sarkar
http://en.wikipedia.org/wiki/Fact_table
http://en.wikipedia.org/wiki/OLTP
http://en.wikipedia.org/wiki/Ralph_Kimball
http://en.wikipedia.org/wiki/Bill_Inmon
http://en.wikipedia.org/wiki/Decision_support
http://en.wikipedia.org/wiki/Heart_of_Midlothian_F.C.
http://en.wikipedia.org/wiki/The_Heart_of_Midlothian


I am still trying to make more sense of the links that I clicked away and the articles I read when I was page hopping.

The Ruby script that parses Chromium history and figures out the Wikipedia links can be found below.




Links:
1. http://www.urbandictionary.com/define.php?term=Wikipedia%20page-hopping
2. http://en.wikipedia.org/wiki/ETA
3. http://www.mail-archive.com/chromium-discuss@googlegroups.com/msg03891.html
4. http://en.wikipedia.org/wiki/Michael_J._Arlen

1 comment:

  1. Shlom laakh,

    one of the Interesting technical blog .. me too wiki-addictive. In my pg batch, no one give a thought to wikipedia. It is too disappointed.

    This summer plan to update the page of my place with photos :)

    and that script, i ll test how python do ..

    God bless U.

    ReplyDelete