Tuesday, January 29, 2013

How to Handle Annoyingly Large CSV Files with MySQL

I recently found some time to look at the data for the Event Recommendation Engine Challenge on kaggle. I was very annoyed when I saw one of the files was a 1.19 gig csv. My MacBook Pro only has 4 gigs of ram and most of it is already being used by chromium. In an effort to stymie off the realization that my laptop is five years old, I decided to create a MySQL database from that csv file. The conversion process was a bit trickier than I expected, so I'm writing this post as more of a note to myself.

First make sure MacPorts is installed. Then install Python 2.7, MySQL5, the MySQL5 server and the MySQL python library.

sudo port install python27 mysql5 mysql5-server py27-mysql
Create the default databases by executing the following command:

sudo -u mysql mysql_install_db5
Start MySQL (once the server has started, press control+c to escape):

sudo /opt/local/lib/mysql5/bin/mysqld_safe &
Then secure the MySQL server with a password:

/opt/local/lib/mysql5/bin/mysqladmin -u root password [password goes here]
Here comes the tricky part. For other languages to access MySQL we will need to create a shortcut to the MySQL socket.

sudo ln -s /opt/local/var/run/mysql5/mysqld.sock /tmp/mysql.sock sudo mkdir /var/mysql sudo ln -s /opt/local/var/run/mysql5/mysqld.sock /var/mysql/mysql.sock
Now the fun of converting a large csv file to a SQL database begins. I'm only using python to do the conversion because the algorithm I wrote for the competition was done in python. If any readers are interested, I can write a quick post on converting in other languages. The rules of the competition forbids me from redistributing the data provided for the competition. So let's use a csv file containing the polling results of the 2008 presidential elections for the rest of this blog post. The data looks something like this:

State Obama McCain Start End Pollster
AL 36 61 27-Oct 28-Oct SurveyUSA
AL 34 54 15-Oct 16-Oct Capital Survey
AL 35 62 8-Oct 9-Oct SurveyUSA


Conceptually, the script is very simple. A connection to the MySQL server is made from python. If the database already exists, it is deleted and a new one is created. The header of the csv file is then used to create fields for the new database. Then the data from the csv file is entered into the database and blank entries are converted to 'None.' The last bit tests the new 'polls' database by pulling all of the polling results from Florida.

import csv import MySQLdb # open the connection to the MySQL server using MySQLdb mydb = MySQLdb.connect(host='localhost', user='root', passwd='[your password goes here]' ) cursor = mydb.cursor() # create the database! # if one with a similar name exists, delete it cursor.execute('DROP DATABASE IF EXISTS test') cursor.execute('CREATE DATABASE test') cursor.execute("USE test") # read the 2008-pres-polls.csv file using the python csv_data = csv.reader(file('2008-pres-polls.csv')) # read header header = csv_data.next() # builds fields from the header fields=[] for col in header: fields.append(('%s VARCHAR(64)' % col)) sqlfields = ',\n'.join(fields) # create a table named 'polls' with fields from the header sqltbl = '''CREATE TABLE polls (%s)''' % sqlfields cursor.execute(sqltbl) # insert None for empty entries def nullify(d): def f(x): if(x == ""): return None else: return x return [f(x) for x in d] # builds the MySQL data insert command def buildInsert(table, numfields): datholder = (numfields-1) * "%s, " + "%s" query = ("insert into %s" % table) + (" values (%s)" % datholder) return query # inserts data from each row of csv file numfields = len(header) query = buildInsert('polls', numfields) for line in csv_data: vals = nullify(line) cursor.execute(query, vals) #test database cursor.execute("SELECT * FROM polls WHERE State = 'FL' ") results = cursor.fetchall() print results
Hope this is helpful to someone! You can download all of the files used in this post from here.

No comments:

Post a Comment