Wednesday, August 22, 2012

How to Count the Number of Wikipedia Edits

This week I finally found some time to catch up on The Colbert Report and I discovered a mischievous little gem. Stephen Colbert and his staff brought up an old Washington Post article, "Wikipedia Edits Forecast Vice Presidential Picks" and subtly suggested that editing is like voting - only with no photo id checks. From five minutes of Googling, I found a torrent of news articles with litanies of what I summarized in the previous sentence. Then I discovered this article that somewhat substantiates the claim with an ugly excel chart, below.


To remind those who have forgotten, Mitt Romney announced Paul Ryan as his running mate on August 11, 2012.

In all the articles I read on this claim, I was surprised that almost no news source visualized the data or told readers how to gather the data. I've previously posted on some fun analytics with Wikipedia using Mathematica. However, with my recent clean installation of Mountain Lion, it has become clear to me that not everyone has access to Mathematica. In this post, I will show how to count the number of revisions/edits on a Wikipedia article using delightfully free python.

The python script I have written is quite easy to understand and depends on the Numpy external library. The script starts out by accessing some raw revisions/edits data for a certain article. To get the revision histories for a Wikipedia article there are two methods. One way is to scrap the html, which can take some effort. My code uses the second by simply calling the MediaWiki API. The revision data requested through the API is in an XML format. This data is read into python in an XML format and then parsed for only the date of the revision. The dates are then tallied by a tallying function I have written and sorted by date. Finally the revision/edits data is outputted by the python script as a properly formated array, just waiting to be plotted. The commented python code is presented below: import urllib2 import xml.etree.ElementTree as ET import dateutil.parser from collections import Counter import numpy as np def tally(datalist): lst = Counter(datalist) a = [] for i,j in lst.items(): a.append([i,j]) return a def revisions(articlename): articlename = articlename.replace(' ','_') # format article title for url url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=%s&' % articlename + \ 'rvprop=user|timestamp&rvlimit=max&redirects$rvuser&format=xml' # data limit is set to max for 500 edits req = urllib2.urlopen(url) data = req.read(); req.close() # reads url response root = ET.fromstring(data) # reads xml data group = root.find(".//revisions") results = [] ## gets revision times from xml data for elem in group.getiterator('rev'): timestamp = elem.get('timestamp') timestamp = dateutil.parser.parse(timestamp).date() # parses timestamp and returns only date results.append(timestamp) a = tally(results) # tallys by date datetally = np.array(sorted(a, key=lambda x: x[0])) # sorts tally by date return datetally Here is a quick example of how to plot with the array that is returned. I chose to use Tim Pawlenty and Marco Rubio to show the limitations of the MediaWiki API. I am also biased towards Pawlenty because of his amazing ads during the GOP primaries. There are Wikipedia pages that have low daily revision activity for long stretches of time and pages with high amounts of revisions in very short periods. The MediaWiki API will return only the previous 500 revisions on any article unless you have a super user status.


from matplotlib import pyplot as plt a = revisions('Tim Pawlenty') b = revisions('Marco Rubio') fig = plt.figure() graph = fig.add_subplot(111) graph.plot(a[:,0], a[:,1], 'r', b[:,0], b[:,1], 'b') fig.autofmt_xdate() plt.legend(('Tim Pawlenty', 'Marco Rubio'), loc='upper left') plt.title('Number of Wikipedia Edits') plt.show() My warning for this script, please use some common sense to interpret results. You can download the source here.

1 comment:

  1. There's shocking news in the sports betting world.

    It's been said that every bettor must watch this,

    Watch this now or quit betting on sports...

    Sports Cash System - SPORTS GAMBLING ROBOT

    ReplyDelete