This week I finally found some time to catch up on The Colbert Report and I discovered a mischievous little gem. Stephen Colbert and his staff brought up an old Washington Post article, "
Wikipedia Edits Forecast Vice Presidential Picks" and subtly suggested that editing is like voting - only with no photo id checks. From five minutes of Googling, I found a torrent of news articles with litanies of what I summarized in the previous sentence. Then I discovered
this article that somewhat substantiates the claim with an ugly excel chart, below.
To remind those who have forgotten, Mitt Romney announced Paul Ryan as his running mate on August 11, 2012.
In all the articles I read on this claim, I was surprised that almost no news source visualized the data or told readers how to gather the data. I've previously
posted on some fun analytics with Wikipedia using Mathematica. However, with my recent clean installation of Mountain Lion, it has become clear to me that not everyone has access to Mathematica. In this post, I will show how to count the number of revisions/edits on a Wikipedia article using delightfully free python.
The python script I have written is quite easy to understand and depends on the Numpy external library. The script starts out by accessing some raw revisions/edits data for a certain article. To get the revision histories for a Wikipedia article there are two methods. One way is to scrap the html, which can take some effort. My code uses the second by simply calling the
MediaWiki API. The revision data requested through the API is in an XML format. This data is read into python in an XML format and then parsed for only the date of the revision. The dates are then tallied by a tallying function I have written and sorted by date. Finally the revision/edits data is outputted by the python script as a properly formated array, just waiting to be plotted. The commented python code is presented below:
import urllib2
import xml.etree.ElementTree as ET
import dateutil.parser
from collections import Counter
import numpy as np
def tally(datalist):
lst = Counter(datalist)
a = []
for i,j in lst.items():
a.append([i,j])
return a
def revisions(articlename):
articlename = articlename.replace(' ','_') # format article title for url
url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=%s&' % articlename + \
'rvprop=user|timestamp&rvlimit=max&redirects$rvuser&format=xml' # data limit is set to max for 500 edits
req = urllib2.urlopen(url)
data = req.read(); req.close() # reads url response
root = ET.fromstring(data) # reads xml data
group = root.find(".//revisions")
results = []
## gets revision times from xml data
for elem in group.getiterator('rev'):
timestamp = elem.get('timestamp')
timestamp = dateutil.parser.parse(timestamp).date() # parses timestamp and returns only date
results.append(timestamp)
a = tally(results) # tallys by date
datetally = np.array(sorted(a, key=lambda x: x[0])) # sorts tally by date
return datetally
Here is a quick example of how to plot with the array that is returned. I chose to use Tim Pawlenty and Marco Rubio to show the limitations of the MediaWiki API. I am also biased towards Pawlenty because of his
amazing ads during the GOP primaries. There are Wikipedia pages that have low daily revision activity for long stretches of time and pages with high amounts of revisions in very short periods. The MediaWiki API will return only the previous 500 revisions on any article unless you have a super user status.
from matplotlib import pyplot as plt
a = revisions('Tim Pawlenty')
b = revisions('Marco Rubio')
fig = plt.figure()
graph = fig.add_subplot(111)
graph.plot(a[:,0], a[:,1], 'r', b[:,0], b[:,1], 'b')
fig.autofmt_xdate()
plt.legend(('Tim Pawlenty', 'Marco Rubio'), loc='upper left')
plt.title('Number of Wikipedia Edits')
plt.show()
My warning for this script, please use some common sense to interpret results. You can download the source
here.