I recently found some time to look at the data for the
Event Recommendation Engine Challenge on
kaggle. I was very annoyed when I saw one of the files was a 1.19 gig csv. My MacBook Pro only has 4 gigs of ram and most of it is already being used by
chromium. In an effort to stymie off the realization that my laptop is five years old, I decided to create a MySQL database from that csv file. The conversion process was a bit trickier than I expected, so I'm writing this post as more of a note to myself.
First make sure
MacPorts is installed. Then install Python 2.7, MySQL5, the MySQL5 server and the MySQL python library.
sudo port install python27 mysql5 mysql5-server py27-mysql
Create the default databases by executing the following command:
sudo -u mysql mysql_install_db5
Start MySQL (once the server has started, press control+c to escape):
sudo /opt/local/lib/mysql5/bin/mysqld_safe &
Then secure the MySQL server with a password:
/opt/local/lib/mysql5/bin/mysqladmin -u root password [password goes here]
Here comes the tricky part. For other languages to access MySQL we will need to create a shortcut to the MySQL socket.
sudo ln -s /opt/local/var/run/mysql5/mysqld.sock /tmp/mysql.sock
sudo mkdir /var/mysql
sudo ln -s /opt/local/var/run/mysql5/mysqld.sock /var/mysql/mysql.sock
Now the fun of converting a large csv file to a SQL database begins. I'm only using python to do the conversion because the algorithm I wrote for the competition was done in python. If any readers are interested, I can write a quick post on converting in other languages. The rules of the competition forbids me from redistributing the data provided for the competition. So let's use a csv file containing the
polling results of the 2008 presidential elections for the rest of this blog post. The data looks something like this:
State |
Obama |
McCain |
Start |
End |
Pollster |
AL |
36 |
61 |
27-Oct |
28-Oct |
SurveyUSA |
AL |
34 |
54 |
15-Oct |
16-Oct |
Capital Survey |
AL |
35 |
62 |
8-Oct |
9-Oct |
SurveyUSA |
⋮ |
⋮ |
⋮ |
⋮ |
⋮ |
⋮ |
Conceptually, the script is very simple. A connection to the MySQL server is made from python. If the database already exists, it is deleted and a new one is created. The header of the csv file is then used to create fields for the new database. Then the data from the csv file is entered into the database and blank entries are converted to 'None.' The last bit tests the new 'polls' database by pulling all of the polling results from Florida.
import csv
import MySQLdb
# open the connection to the MySQL server using MySQLdb
mydb = MySQLdb.connect(host='localhost',
user='root',
passwd='[your password goes here]'
)
cursor = mydb.cursor()
# create the database!
# if one with a similar name exists, delete it
cursor.execute('DROP DATABASE IF EXISTS test')
cursor.execute('CREATE DATABASE test')
cursor.execute("USE test")
# read the 2008-pres-polls.csv file using the python
csv_data = csv.reader(file('2008-pres-polls.csv'))
# read header
header = csv_data.next()
# builds fields from the header
fields=[]
for col in header:
fields.append(('%s VARCHAR(64)' % col))
sqlfields = ',\n'.join(fields)
# create a table named 'polls' with fields from the header
sqltbl = '''CREATE TABLE polls (%s)''' % sqlfields
cursor.execute(sqltbl)
# insert None for empty entries
def nullify(d):
def f(x):
if(x == ""):
return None
else:
return x
return [f(x) for x in d]
# builds the MySQL data insert command
def buildInsert(table, numfields):
datholder = (numfields-1) * "%s, " + "%s"
query = ("insert into %s" % table) + (" values (%s)" % datholder)
return query
# inserts data from each row of csv file
numfields = len(header)
query = buildInsert('polls', numfields)
for line in csv_data:
vals = nullify(line)
cursor.execute(query, vals)
#test database
cursor.execute("SELECT * FROM polls WHERE State = 'FL' ")
results = cursor.fetchall()
print results
Hope this is helpful to someone! You can download all of the files used in this post from
here.