Night Time Curiosities

Tuesday, January 29, 2013

How to Handle Annoyingly Large CSV Files with MySQL

I recently found some time to look at the data for the Event Recommendation Engine Challenge on kaggle. I was very annoyed when I saw one of the files was a 1.19 gig csv. My MacBook Pro only has 4 gigs of ram and most of it is already being used by chromium. In an effort to stymie off the realization that my laptop is five years old, I decided to create a MySQL database from that csv file. The conversion process was a bit trickier than I expected, so I'm writing this post as more of a note to myself.

First make sure MacPorts is installed. Then install Python 2.7, MySQL5, the MySQL5 server and the MySQL python library.

sudo port install python27 mysql5 mysql5-server py27-mysql
Create the default databases by executing the following command:

sudo -u mysql mysql_install_db5
Start MySQL (once the server has started, press control+c to escape):

sudo /opt/local/lib/mysql5/bin/mysqld_safe &
Then secure the MySQL server with a password:

/opt/local/lib/mysql5/bin/mysqladmin -u root password [password goes here]
Here comes the tricky part. For other languages to access MySQL we will need to create a shortcut to the MySQL socket.

sudo ln -s /opt/local/var/run/mysql5/mysqld.sock /tmp/mysql.sock sudo mkdir /var/mysql sudo ln -s /opt/local/var/run/mysql5/mysqld.sock /var/mysql/mysql.sock
Now the fun of converting a large csv file to a SQL database begins. I'm only using python to do the conversion because the algorithm I wrote for the competition was done in python. If any readers are interested, I can write a quick post on converting in other languages. The rules of the competition forbids me from redistributing the data provided for the competition. So let's use a csv file containing the polling results of the 2008 presidential elections for the rest of this blog post. The data looks something like this:

State	Obama	McCain	Start	End	Pollster
AL	36	61	27-Oct	28-Oct	SurveyUSA
AL	34	54	15-Oct	16-Oct	Capital Survey
AL	35	62	8-Oct	9-Oct	SurveyUSA
⋮	⋮	⋮	⋮	⋮	⋮

Conceptually, the script is very simple. A connection to the MySQL server is made from python. If the database already exists, it is deleted and a new one is created. The header of the csv file is then used to create fields for the new database. Then the data from the csv file is entered into the database and blank entries are converted to 'None.' The last bit tests the new 'polls' database by pulling all of the polling results from Florida.

import csv import MySQLdb # open the connection to the MySQL server using MySQLdb mydb = MySQLdb.connect(host='localhost', user='root', passwd='[your password goes here]' ) cursor = mydb.cursor() # create the database! # if one with a similar name exists, delete it cursor.execute('DROP DATABASE IF EXISTS test') cursor.execute('CREATE DATABASE test') cursor.execute("USE test") # read the 2008-pres-polls.csv file using the python csv_data = csv.reader(file('2008-pres-polls.csv')) # read header header = csv_data.next() # builds fields from the header fields=[] for col in header: fields.append(('%s VARCHAR(64)' % col)) sqlfields = ',\n'.join(fields) # create a table named 'polls' with fields from the header sqltbl = '''CREATE TABLE polls (%s)''' % sqlfields cursor.execute(sqltbl) # insert None for empty entries def nullify(d): def f(x): if(x == ""): return None else: return x return [f(x) for x in d] # builds the MySQL data insert command def buildInsert(table, numfields): datholder = (numfields-1) * "%s, " + "%s" query = ("insert into %s" % table) + (" values (%s)" % datholder) return query # inserts data from each row of csv file numfields = len(header) query = buildInsert('polls', numfields) for line in csv_data: vals = nullify(line) cursor.execute(query, vals) #test database cursor.execute("SELECT * FROM polls WHERE State = 'FL' ") results = cursor.fetchall() print results
Hope this is helpful to someone! You can download all of the files used in this post from here.

Wednesday, January 23, 2013

Kaggling Dark Halos

Success! I scored higher than Lenstool!

This was one of those Kaggle competitions that I had no right to compete in. My goal was to beat the benchmark, which was a state of the art program in 2007. I will be the first to admit this wasn't the most ambitious goal. However, the closest I have ever been to an astronomy class was an episode of The Magic School Bus - it was the one where they explored all nine planets. Though my best method ranked 121 out of 357, I still consider it a victory (Lenstool ranked 201 out of 357.) My victory definitely exemplifies how childish logic can be successfully used in place of cosmological knowledge.

Smelling Dark Matter

I've been using the little free time I have to compete in kaggle's Observing Dark Worlds competition. I have to say that this machine learning competition has given me a new appreciation of what astronomers deal with in their research. It also illustrated why so many of the astronomy majors I knew in college always looked so sleep deprived - I had attributed it to the nocturnal nature of twinkles.

In the Observing Dark Worlds competition, the goal is to pin point the locations of invisible dark matter halos with only the coordinates of galaxies and their respective ellipticity. Below is a quick visual example of the difficulty of this task (my code for the visualizations can be found here).

This is the visualization of the data from training sky 150.

The two blue circles are the dark halos hiding in training sky 150.

I haven't made much progress in my analysis - in fact I have never been this stumped on any problem. So far I have only been able to determine the location of one dark matter halo exactly and generate a few probable locations of the other one. Below is the visualization of my progress for training sky 150 and training sky 124 from the competition. The red circles are my estimated locations and the blue circles are the true locations of the dark matter halos.

Training Sky 150

Training Sky 124

If it is before the competition deadline and you are interested in my method or code, I am more than happy to win this competition through a team effort! For my regular readers, I will make a detailed posting after the competition deadline.

Tuesday, September 4, 2012

What is the Probability of Getting (k) Heads in a Row for (n) Consecutive Tosses?

I asked myself a fun question after reading a post on QuantNet. What are the odds of getting two, four, or six heads after five, ten, or a hundred consecutive tosses of a fair coin? It seemed like a fun high school leveled math problem and with some quick python I was able to generate a pretty graph to answer this question.

LaTeX Template for Lecture or Class Notes

A friend of mine asked me to create a LaTeX template for lecture notes. He wanted to learn LaTeX by typesetting all of his class notes. I think I created a pretty nice template and wanted to share it with everyone.

You can download the template from here! Happy TeXing!

Tuesday, August 28, 2012

Syntax Highlighting in LaTeX with Minted

Here is a quick tip for those of you who use MacPorts and want pretty syntax highlighting in your LaTeX documents. Recently I discovered a great looking package called minted, which depends on an external python syntax highlighter called Pygments. I couldn't find any decent tutorials online for a quick and easy setup, so I decided to write one. I hope it will be helpful for someone.

To my readers who didn't come upon this article through google, the minted package is a better alternative to the listings package. Minted allows you to present code in techno-color! This means the programs you have worked hard on, like:
int main() { printf("hello, world"); return 0; } will no longer be restricted to marginalizing black and white:

With minted your code will finally be noticed in the appendix of that report...

Disclaimer: Minted can do many things, but it will not make bad programs work. Though it will make them look like they could work.

How to Count the Number of Wikipedia Edits

This week I finally found some time to catch up on The Colbert Report and I discovered a mischievous little gem. Stephen Colbert and his staff brought up an old Washington Post article, "Wikipedia Edits Forecast Vice Presidential Picks" and subtly suggested that editing is like voting - only with no photo id checks. From five minutes of Googling, I found a torrent of news articles with litanies of what I summarized in the previous sentence. Then I discovered this article that somewhat substantiates the claim with an ugly excel chart, below.

To remind those who have forgotten, Mitt Romney announced Paul Ryan as his running mate on August 11, 2012.

In all the articles I read on this claim, I was surprised that almost no news source visualized the data or told readers how to gather the data. I've previously posted on some fun analytics with Wikipedia using Mathematica. However, with my recent clean installation of Mountain Lion, it has become clear to me that not everyone has access to Mathematica. In this post, I will show how to count the number of revisions/edits on a Wikipedia article using delightfully free python.

The python script I have written is quite easy to understand and depends on the Numpy external library. The script starts out by accessing some raw revisions/edits data for a certain article. To get the revision histories for a Wikipedia article there are two methods. One way is to scrap the html, which can take some effort. My code uses the second by simply calling the MediaWiki API. The revision data requested through the API is in an XML format. This data is read into python in an XML format and then parsed for only the date of the revision. The dates are then tallied by a tallying function I have written and sorted by date. Finally the revision/edits data is outputted by the python script as a properly formated array, just waiting to be plotted. The commented python code is presented below: import urllib2 import xml.etree.ElementTree as ET import dateutil.parser from collections import Counter import numpy as np def tally(datalist): lst = Counter(datalist) a = [] for i,j in lst.items(): a.append([i,j]) return a def revisions(articlename): articlename = articlename.replace(' ','_') # format article title for url url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=%s&' % articlename + \ 'rvprop=user|timestamp&rvlimit=max&redirects$rvuser&format=xml' # data limit is set to max for 500 edits req = urllib2.urlopen(url) data = req.read(); req.close() # reads url response root = ET.fromstring(data) # reads xml data group = root.find(".//revisions") results = [] ## gets revision times from xml data for elem in group.getiterator('rev'): timestamp = elem.get('timestamp') timestamp = dateutil.parser.parse(timestamp).date() # parses timestamp and returns only date results.append(timestamp) a = tally(results) # tallys by date datetally = np.array(sorted(a, key=lambda x: x[0])) # sorts tally by date return datetally Here is a quick example of how to plot with the array that is returned. I chose to use Tim Pawlenty and Marco Rubio to show the limitations of the MediaWiki API. I am also biased towards Pawlenty because of his amazing ads during the GOP primaries. There are Wikipedia pages that have low daily revision activity for long stretches of time and pages with high amounts of revisions in very short periods. The MediaWiki API will return only the previous 500 revisions on any article unless you have a super user status.

from matplotlib import pyplot as plt a = revisions('Tim Pawlenty') b = revisions('Marco Rubio') fig = plt.figure() graph = fig.add_subplot(111) graph.plot(a[:,0], a[:,1], 'r', b[:,0], b[:,1], 'b') fig.autofmt_xdate() plt.legend(('Tim Pawlenty', 'Marco Rubio'), loc='upper left') plt.title('Number of Wikipedia Edits') plt.show() My warning for this script, please use some common sense to interpret results. You can download the source here.

Wednesday, August 15, 2012

Historical Intraday Stock Price Data with Python

By popular request, this post will present how to acquire intraday stock data from google finance using python. The general structure of the code is pretty simple to understand. First a request url is created and sent. The response is then read by python to create an array or matrix of the financial data and a vector of time data. This array is created with the help of the popular Numpy package that can be downloaded from here. Then in one if-statement, the time data is then restructured into a proper unix time format and translated to a more familiar date string for each financial data point. The translated time vector is then joined with the financial array to produce a single easy to work with financial time series array. Since Numpy has been ported to Python 3, the code I wrote should be compatibile with both Python 2.X and 3.X. Here it is: import urllib2 import urllib import numpy as np from datetime import datetime urldata = {} urldata['q'] = ticker = 'JPM' # stock symbol urldata['x'] = 'NYSE' # exchange symbol urldata['i'] = '60' # interval urldata['p'] = '1d' # number of past trading days (max has been 15d) urldata['f'] = 'd,o,h,l,c,v' # requested data d is time, o is open, c is closing, h is high, l is low, v is volume url_values = urllib.urlencode(urldata) url = 'http://www.google.com/finance/getprices' full_url = url + '?' + url_values req = urllib2.Request(full_url) response = urllib2.urlopen(req).readlines() getdata = response del getdata[0:7] numberoflines = len(getdata) returnMat = np.zeros((numberoflines, 5)) timeVector = [] index = 0 for line in getdata: line = line.strip('a') listFromLine = line.split(',') returnMat[index,:] = listFromLine[1:6] timeVector.append(int(listFromLine[0])) index += 1 # convert Unix or epoch time to something more familiar for x in timeVector: if x > 500: z = x timeVector[timeVector.index(x)] = datetime.fromtimestamp(x) else: y = z+x*60 # multiply by interval timeVector[timeVector.index(x)] = datetime.fromtimestamp(y) tdata = np.array(timeVector) time = tdata.reshape((len(tdata),1)) intradata = np.concatenate((time, returnMat), axis=1) # array of all data with the properly formated times

The Walking Dead Model

Today, I finally discovered a useful application of my chemical engineering degree. To initiate engaging philosophical and mathematical conversations about zombies with grumpy old men.

As I sat at my terminal watching the first episode of The Walking Dead, a raspy voice next to me muttered, "That would never happen."

I replied with the usual, "Yes, I know zombies don't exist." Usually that avoids the unwanted confrontation and allows for a change of subject. It is a formulaic reply that Dale Carnegie would use - if he was a zombie enthusiast. Then he took me by surprise when he explained that the population of a small town would not be completely zombified or dead.

In all honesty, I have wondered what I would do in certain zombie apocalypse scenarios. But I have never thought about the larger picture like the population of a town. With a few Auntie Anne's napkins and 45 minutes before boarding we began outlining a population model using the zombie rules from the universe of The Walking Dead.

Installing Scipy from MacPorts on Mountain Lion

After doing a fresh install of Mountain Lion I proceeded with my age old tradition of reinstalling all of my favorite libraries from MacPorts. Everything installed wonderfully except for PyObjC, Scipy, and a few of their dependencies like gcc45 and libunwind-headers. With an hour of trial and error that included playing with the pip installer and fighting the temptation to switch to Homebrew, I found an easy and quick fix for MacPorts.

If you have already attempted to install any of those ports I have listed, please clean those ports that have failed to install. For example, my installation of the py27-scipy port failed so I must use the following command to clean it:

sudo port clean --all py27-scipy

The quick fix to allow MacPorts to install these wonderful ports properly is done by reconfirming the Xcode license agreement with the following command:

sudo xcodebuild -license

Continuing from my example, sudo port install py27-scipy should now execute and install without a hitch. The same goes for the other ports I listed earlier.