Night Time Curiosities: 2012

Tuesday, December 4, 2012

Smelling Dark Matter

I've been using the little free time I have to compete in kaggle's Observing Dark Worlds competition. I have to say that this machine learning competition has given me a new appreciation of what astronomers deal with in their research. It also illustrated why so many of the astronomy majors I knew in college always looked so sleep deprived - I had attributed it to the nocturnal nature of twinkles.

In the Observing Dark Worlds competition, the goal is to pin point the locations of invisible dark matter halos with only the coordinates of galaxies and their respective ellipticity. Below is a quick visual example of the difficulty of this task (my code for the visualizations can be found here).

This is the visualization of the data from training sky 150.

The two blue circles are the dark halos hiding in training sky 150.

I haven't made much progress in my analysis - in fact I have never been this stumped on any problem. So far I have only been able to determine the location of one dark matter halo exactly and generate a few probable locations of the other one. Below is the visualization of my progress for training sky 150 and training sky 124 from the competition. The red circles are my estimated locations and the blue circles are the true locations of the dark matter halos.

Training Sky 150

Training Sky 124

If it is before the competition deadline and you are interested in my method or code, I am more than happy to win this competition through a team effort! For my regular readers, I will make a detailed posting after the competition deadline.

Tuesday, September 4, 2012

What is the Probability of Getting (k) Heads in a Row for (n) Consecutive Tosses?

I asked myself a fun question after reading a post on QuantNet. What are the odds of getting two, four, or six heads after five, ten, or a hundred consecutive tosses of a fair coin? It seemed like a fun high school leveled math problem and with some quick python I was able to generate a pretty graph to answer this question.

LaTeX Template for Lecture or Class Notes

A friend of mine asked me to create a LaTeX template for lecture notes. He wanted to learn LaTeX by typesetting all of his class notes. I think I created a pretty nice template and wanted to share it with everyone.

You can download the template from here! Happy TeXing!

Tuesday, August 28, 2012

Syntax Highlighting in LaTeX with Minted

Here is a quick tip for those of you who use MacPorts and want pretty syntax highlighting in your LaTeX documents. Recently I discovered a great looking package called minted, which depends on an external python syntax highlighter called Pygments. I couldn't find any decent tutorials online for a quick and easy setup, so I decided to write one. I hope it will be helpful for someone.

To my readers who didn't come upon this article through google, the minted package is a better alternative to the listings package. Minted allows you to present code in techno-color! This means the programs you have worked hard on, like:
int main() { printf("hello, world"); return 0; } will no longer be restricted to marginalizing black and white:

With minted your code will finally be noticed in the appendix of that report...

Disclaimer: Minted can do many things, but it will not make bad programs work. Though it will make them look like they could work.

How to Count the Number of Wikipedia Edits

This week I finally found some time to catch up on The Colbert Report and I discovered a mischievous little gem. Stephen Colbert and his staff brought up an old Washington Post article, "Wikipedia Edits Forecast Vice Presidential Picks" and subtly suggested that editing is like voting - only with no photo id checks. From five minutes of Googling, I found a torrent of news articles with litanies of what I summarized in the previous sentence. Then I discovered this article that somewhat substantiates the claim with an ugly excel chart, below.

To remind those who have forgotten, Mitt Romney announced Paul Ryan as his running mate on August 11, 2012.

In all the articles I read on this claim, I was surprised that almost no news source visualized the data or told readers how to gather the data. I've previously posted on some fun analytics with Wikipedia using Mathematica. However, with my recent clean installation of Mountain Lion, it has become clear to me that not everyone has access to Mathematica. In this post, I will show how to count the number of revisions/edits on a Wikipedia article using delightfully free python.

The python script I have written is quite easy to understand and depends on the Numpy external library. The script starts out by accessing some raw revisions/edits data for a certain article. To get the revision histories for a Wikipedia article there are two methods. One way is to scrap the html, which can take some effort. My code uses the second by simply calling the MediaWiki API. The revision data requested through the API is in an XML format. This data is read into python in an XML format and then parsed for only the date of the revision. The dates are then tallied by a tallying function I have written and sorted by date. Finally the revision/edits data is outputted by the python script as a properly formated array, just waiting to be plotted. The commented python code is presented below: import urllib2 import xml.etree.ElementTree as ET import dateutil.parser from collections import Counter import numpy as np def tally(datalist): lst = Counter(datalist) a = [] for i,j in lst.items(): a.append([i,j]) return a def revisions(articlename): articlename = articlename.replace(' ','_') # format article title for url url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=%s&' % articlename + \ 'rvprop=user|timestamp&rvlimit=max&redirects$rvuser&format=xml' # data limit is set to max for 500 edits req = urllib2.urlopen(url) data = req.read(); req.close() # reads url response root = ET.fromstring(data) # reads xml data group = root.find(".//revisions") results = [] ## gets revision times from xml data for elem in group.getiterator('rev'): timestamp = elem.get('timestamp') timestamp = dateutil.parser.parse(timestamp).date() # parses timestamp and returns only date results.append(timestamp) a = tally(results) # tallys by date datetally = np.array(sorted(a, key=lambda x: x[0])) # sorts tally by date return datetally Here is a quick example of how to plot with the array that is returned. I chose to use Tim Pawlenty and Marco Rubio to show the limitations of the MediaWiki API. I am also biased towards Pawlenty because of his amazing ads during the GOP primaries. There are Wikipedia pages that have low daily revision activity for long stretches of time and pages with high amounts of revisions in very short periods. The MediaWiki API will return only the previous 500 revisions on any article unless you have a super user status.

from matplotlib import pyplot as plt a = revisions('Tim Pawlenty') b = revisions('Marco Rubio') fig = plt.figure() graph = fig.add_subplot(111) graph.plot(a[:,0], a[:,1], 'r', b[:,0], b[:,1], 'b') fig.autofmt_xdate() plt.legend(('Tim Pawlenty', 'Marco Rubio'), loc='upper left') plt.title('Number of Wikipedia Edits') plt.show() My warning for this script, please use some common sense to interpret results. You can download the source here.

Wednesday, August 15, 2012

Historical Intraday Stock Price Data with Python

By popular request, this post will present how to acquire intraday stock data from google finance using python. The general structure of the code is pretty simple to understand. First a request url is created and sent. The response is then read by python to create an array or matrix of the financial data and a vector of time data. This array is created with the help of the popular Numpy package that can be downloaded from here. Then in one if-statement, the time data is then restructured into a proper unix time format and translated to a more familiar date string for each financial data point. The translated time vector is then joined with the financial array to produce a single easy to work with financial time series array. Since Numpy has been ported to Python 3, the code I wrote should be compatibile with both Python 2.X and 3.X. Here it is: import urllib2 import urllib import numpy as np from datetime import datetime urldata = {} urldata['q'] = ticker = 'JPM' # stock symbol urldata['x'] = 'NYSE' # exchange symbol urldata['i'] = '60' # interval urldata['p'] = '1d' # number of past trading days (max has been 15d) urldata['f'] = 'd,o,h,l,c,v' # requested data d is time, o is open, c is closing, h is high, l is low, v is volume url_values = urllib.urlencode(urldata) url = 'http://www.google.com/finance/getprices' full_url = url + '?' + url_values req = urllib2.Request(full_url) response = urllib2.urlopen(req).readlines() getdata = response del getdata[0:7] numberoflines = len(getdata) returnMat = np.zeros((numberoflines, 5)) timeVector = [] index = 0 for line in getdata: line = line.strip('a') listFromLine = line.split(',') returnMat[index,:] = listFromLine[1:6] timeVector.append(int(listFromLine[0])) index += 1 # convert Unix or epoch time to something more familiar for x in timeVector: if x > 500: z = x timeVector[timeVector.index(x)] = datetime.fromtimestamp(x) else: y = z+x*60 # multiply by interval timeVector[timeVector.index(x)] = datetime.fromtimestamp(y) tdata = np.array(timeVector) time = tdata.reshape((len(tdata),1)) intradata = np.concatenate((time, returnMat), axis=1) # array of all data with the properly formated times

The Walking Dead Model

Today, I finally discovered a useful application of my chemical engineering degree. To initiate engaging philosophical and mathematical conversations about zombies with grumpy old men.

As I sat at my terminal watching the first episode of The Walking Dead, a raspy voice next to me muttered, "That would never happen."

I replied with the usual, "Yes, I know zombies don't exist." Usually that avoids the unwanted confrontation and allows for a change of subject. It is a formulaic reply that Dale Carnegie would use - if he was a zombie enthusiast. Then he took me by surprise when he explained that the population of a small town would not be completely zombified or dead.

In all honesty, I have wondered what I would do in certain zombie apocalypse scenarios. But I have never thought about the larger picture like the population of a town. With a few Auntie Anne's napkins and 45 minutes before boarding we began outlining a population model using the zombie rules from the universe of The Walking Dead.

Installing Scipy from MacPorts on Mountain Lion

After doing a fresh install of Mountain Lion I proceeded with my age old tradition of reinstalling all of my favorite libraries from MacPorts. Everything installed wonderfully except for PyObjC, Scipy, and a few of their dependencies like gcc45 and libunwind-headers. With an hour of trial and error that included playing with the pip installer and fighting the temptation to switch to Homebrew, I found an easy and quick fix for MacPorts.

If you have already attempted to install any of those ports I have listed, please clean those ports that have failed to install. For example, my installation of the py27-scipy port failed so I must use the following command to clean it:

sudo port clean --all py27-scipy

The quick fix to allow MacPorts to install these wonderful ports properly is done by reconfirming the Xcode license agreement with the following command:

sudo xcodebuild -license

Continuing from my example, sudo port install py27-scipy should now execute and install without a hitch. The same goes for the other ports I listed earlier.

Monday, July 30, 2012

Creating a Sonic Screwdriver with an Arduino Nano, Part 1

I was pretty excited to hear about the release of a Sonic Screwdriver remote. But after watching the YouTube demo, it seems kind of lame. The appeal of the Sonic Screwdriver, used by the good Doctor, comes from its universal applicability to help the Doctor get out of sticky situations or rid general annoyances. A programmable remote just doesn't cut it.

For those in the US who have never caught an old episode of Doctor Who at 5:30 AM in the morning on PBS or watch the newer episodes on Amazon Prime, a Sonic Screwdriver is akin to MacGyver's swiss army knife, but only more futuristic - the BBC prop makers added an LED to show that it is from the future.

My disappointment in the Sonic Screwdriver remote (available on ThinkGeek) has inspired me to design a Sonic Screwdriver that is more true to fiction. I just can't see the Doctor sitting somewhere waving his screwdriver infront of an old Panasonic remote. In this post I will go my plans to integrate a TV-B-Gone into a Sonic Screwdriver. In future posts, I may show how to integrate other functions such as 128 kHz RFID technology (to open my apartment or work doors) and possibly integrating 13.56 MHz MIFARE technology (in DC we use SmarTrip for public transportation).

Face and Eye Tracking Fun!

I recently discovered the openframeworks toolkit. It is an open source C++ toolkit that wraps together many commonly used libraries for simple and fast experimentation. I absolutely love it!

When I shared code in the past, it took quite a bit of time to document what libraries, which version, and sometimes (even with C++) which operating system the new user must install for the code to compile properly. With openframeworks, that annoyance is a thing of the past.

For this post I decided to make a quick and fun example adopted from one of openframeworks' provided examples on detecting faces in pictures. The example can be found under the folder: examples/addons/opencvHaarFinderExample.

I decided to extend their example to detect faces from a live webcam or a quicktime movie. I was amazed at how few lines of code it took. To get started, download the appropriate toolkit for your IDE, here. Then copy the folder opencvHaarFinderExample in examples/addons/ to apps/myApps/. Then replace the code in testApp.cpp with the following:
#include "testApp.h" //-------------------------------------------------------------- void testApp::setup(){ camWidth = 480; camHeight = 360; //uncomment and comment lines below to test different detectors finder.setup("haarcascade_frontalface_default.xml"); //finder.setup("haarcascade_frontalface_alt.xml"); //finder.setup("haarcascade_frontalface_alt2.xml"); //finder.setup("haarcascade_eyeglasses.xml"); //finder.setup("haarcascade_eye.xml"); #ifdef _USE_LIVE_VIDEO VideoGrabber.setVerbose(true); VideoGrabber.initGrabber(camWidth,camHeight,OF_IMAGE_COLOR); colorImg.allocate(camWidth,camHeight,OF_IMAGE_COLOR); #else VideoPlayer.loadMovie("ThemeFromShaft.mov"); VideoPlayer.play(); #endif } //-------------------------------------------------------------- void testApp::update(){ ofBackground(100,100,100); bool newFrame = false; #ifdef _USE_LIVE_VIDEO VideoGrabber.grabFrame(); newFrame = VideoGrabber.isFrameNew(); #else VideoPlayer.idleMovie(); newFrame = VideoPlayer.isFrameNew(); #endif if (newFrame){ #ifdef _USE_LIVE_VIDEO colorImg.setFromPixels(VideoGrabber.getPixels(), camWidth, camHeight, OF_IMAGE_COLOR); #else colorImg.setFromPixels(VideoPlayer.getPixels(), camWidth, camHeight, OF_IMAGE_COLOR); #endif finder.findHaarObjects(colorImg); } } //-------------------------------------------------------------- void testApp::draw(){ ofSetHexColor(0xffffff); colorImg.draw(0,0); ofNoFill(); for(int i = 0; i < finder.blobs.size(); i++) { ofRectangle dim = finder.blobs[i].boundingRect; ofRect(dim.x, dim.y, dim.width, dim.height); } ofSetHexColor(0xffffff); char reportStr[1024]; sprintf(reportStr, "Number Detected: %lu fps: %f", finder.blobs.size(), ofGetFrameRate()); ofDrawBitmapString(reportStr, 20, 20+camHeight); } //-------------------------------------------------------------- void testApp::keyPressed (int key){ switch (key){ case ' ': break; } } //-------------------------------------------------------------- void testApp::mouseMoved(int x, int y ){ } //-------------------------------------------------------------- void testApp::mouseDragged(int x, int y, int button){ } //-------------------------------------------------------------- void testApp::mousePressed(int x, int y, int button){ } //-------------------------------------------------------------- void testApp::mouseReleased(int x, int y, int button){ } //-------------------------------------------------------------- void testApp::windowResized(int w, int h){ } Then replace the code in testApp.h with the following:
#ifndef _TEST_APP #define _TEST_APP #include "ofMain.h" #include "ofxCvHaarFinder.h" #define _USE_LIVE_VIDEO //comment to use video, uncomment to use webcam class testApp : public ofBaseApp{ public: void setup(); void update(); void draw(); void keyPressed (int key); void mouseMoved(int x, int y ); void mouseDragged(int x, int y, int button); void mousePressed(int x, int y, int button); void mouseReleased(int x, int y, int button); void windowResized(int w, int h); int camWidth, camHeight; #ifdef _USE_LIVE_VIDEO ofVideoGrabber VideoGrabber; #else ofVideoPlayer VideoPlayer; #endif ofxCvHaarFinder finder; ofImage colorImg; }; #endif For those mac users out there, just download this, open /apps/myApps/HeadTracking/TrackingExample.xcodeproj in Xcode (3.2 and above) and hit run!

Wednesday, July 11, 2012

Historical Minute-by-Minute Stock Prices in MATLAB

This is an extension of my first post on this blog. My first post went over how to gather the minute to minute stock price data from Google Finance using Mathematica. Using the same general programming structure, I've created a function to do this in MATLAB.

The IntraDayStockData function can be obtained here. This function takes necessary inputs of a stock symbol and the name of the exchange the stock is traded in. This function also takes optional inputs of quote frequency (in seconds) and number of previous trading days. By default, the optional inputs are set for 1 minute quote intervals and 15 previous trading days. The output of this function are the time, the time in string format, trade volume, highest prices, lowest prices, and closing prices. Here is a quick example for getting the minute by minute quotes of the latest trading day (today) for JP Morgan Chase:

jpm = IntraDayStockData('JPM','NYSE','60','1d'); plot(jpm.date,jpm.close,'b-'); hold on; plot(jpm.date,jpm.high,'r-'); hold on; plot(jpm.date,jpm.low,'g-'); datetick('x',16);

Tuesday, July 3, 2012

Where are the Homicides in Chicago?

Recently, CNN did a report on the high number of homicides that is plaguing Chicago. One of the sure fire solutions suggested was to place more police in the areas where these killings occur. It got me wondering where these homicides actually occur.

Using the code from my previous post, I called the crime API that the city of Chicago provides and plotted out all the reported homicides between August 1st, 2011 and May 31st, 2012. There were 403 in total.

Now where did these incidents occur?

Initially it looks as if the homicides are wide spread throughout the city. However, if we do a breakdown by neighborhood it becomes obvious that some are much worse off than others. Below is a table of each Chicago neighborhood and its associated number of homicides between August 1st, 2011 and May 31st, 2012. I've also provided a more convenient graphical representation of this table. The Mathematica code and data files I used for this post can be found here.

Wednesday, June 27, 2012

Hidden Markov Model of Chicago Crimes

This is a bit of a side note from my attempts to model the number of daily crimes in the city of Chicago. I've been treating the Chicago crime data as a parametric random process where the parameters of the stochastic process can be estimated in a well-defined manner. My first few attempts used various autoregressive models that assumed the crime data was a signal that cross-correlated with itself. From my knowledge of statistics, this is the proper and prescribed treatment for the data. But I have been wondering how well a strictly stochastic model like the Hidden Markov Model (HMM) will perform.

A hidden Markov model is not that hard of a conceptual stretch when applied to crime modeling. Lets assume that all criminal activities in Chicago evolved according to a Markov process, which may have a finite or infinite amount of states. The underlying Markov process is not directly observable because these states are hidden. Luckily, for each state there is an associated stochastic process which produces an observable signal (a crime report). So as this criminal system in Chicago evolves over time, delegated by the hidden Markov chain, it produces crime reports as observable outcomes.

For a full explanation of the mathematical theory behind HMMs please refer to this article by Lawrence Rabiner. Even though Rabiner focuses on HMM applications to speech recognition, it covers in great detail the theory used in my implementation. To simplify things for myself and to cut down on computation time, I created a hidden Markov model with univariate Gaussian outcomes. The model is fitted using a maximum likelihood estimation, where the values of the parameters are those that maximizes the likelihood of the observed data. Here, I used the Baum-Welch Algorithm for estimating the maximum likelihood.

The observed Chicago crime data is the total number of daily reported crimes between July 1, 2011 and March 31, 2012. A log difference transformation of this data was performed to create a more normal distribution of observables. This is all done in Mathematica with six lines of code:

csv = Import[
   "https://data.cityofchicago.org/api/views/x2n5-8w5q/rows.csv", 
   "CSV"];

start = "07/01/2011";  end = "03/31/2012";

range = Join[
   First[Position[
      If[StringMatchQ[#, start ~~ __], 1, 0] & /@ 
       Drop[csv, 1][[All, 2]], 1]] + 1, 
   Last[Position[
      If[StringMatchQ[#, end ~~ __], 1, 0] & /@ 
       Drop[csv, 1][[All, 2]], 1]] + 1];

crimeseries = 
  Tally[DateString[
      DateList[{ToString[#], {"Month", "/", "Day", "/", "Year", " ", 
         "Hour12", ":", "Minute", " ", "AMPM"}}], {"MonthName", " ", 
       "Day", " ", "Year"}] & /@ Take[csv, range][[All, 2]]];

The HMM fitting of the crime data uses three functions I created in Mathematica:

MarkovChainEquilibrium[TransitionMatrix_] := 
  Inverse[Transpose[TransitionMatrix] - 
     IdentityMatrix[Length[TransitionMatrix]] + 1].ConstantArray[1, 
    Length[TransitionMatrix]];

RandomInitial[Data_, States_] := Module[
   {TransMat, i, vnGamma, ProbVec, StateMean, vnQ, StateStD, t, mnW},
   TransMat = (#/Total[#]) & /@ (RandomReal[{0.01, 0.99}, {States, 
        States}]);
   ProbVec = MarkovChainEquilibrium[TransMat];
   StateMean = ((#/Total[#]) &[
       RandomReal[{-0.99, 0.99}, States]]) Mean[Data];
   StateStD = 
    Sqrt[((#/Total[#]) &[RandomReal[{0.1, 0.99}, States]]) Variance[
       Data]];
   {ProbVec, TransMat, StateMean, StateStD}];

Options[HMM] = {"LikelihoodTolerance" -> 10^(-7), 
   "MaxIterations" -> 500};

HMM[Data_, {InitialIota_, InitialA_, InitialMean_, InitialStD_}, 
   OptionsPattern[]] := Module[
   {TransMat, matAlpha, matB, matBeta, nBIC, matGamma, ProbVec, 
    StateLogLikelihood, iMaxIter, StateMean, iN, StateNu, StateStD, t,
     iT, nTol, matWeights, matXi},
   nTol = OptionValue["LikelihoodTolerance"]; 
   iMaxIter = OptionValue["MaxIterations"];
   iT = Length[Data]; iN = Length[InitialA];
   ProbVec = InitialIota; TransMat = InitialA; 
   StateMean = InitialMean; StateStD = InitialStD;
   matB = 
    Table[PDF[NormalDistribution[StateMean[[#]], StateStD[[#]]], 
        Data[[t]]] & /@ Range[iN], {t, 1, iT}];
   matAlpha = Array[0. &, {iT, iN}];
   StateNu = Array[0. &, iT];
   matAlpha[[1]] = ProbVec matB[[1]];
   StateNu[[1]] = 1/Total[matAlpha[[1]]];
   matAlpha[[1]] *= StateNu[[1]];
   For[t = 2, t <= iT, t++, 
    matAlpha[[t]] = (matAlpha[[t - 1]].TransMat) matB[[t]]; 
    StateNu[[t]] = 1/Total[matAlpha[[t]]]; 
    matAlpha[[t]] *= StateNu[[t]]];
   StateLogLikelihood = {-Total[Log[StateNu]]};
   Do[matBeta = Array[0. &, {iT, iN}]; 
    matBeta[[iT, ;;]] = StateNu[[iT]];
    For[t = iT - 1, t >= 1, t--, 
     matBeta[[t]] = 
      TransMat.(matBeta[[t + 1]] matB[[t + 1]]) StateNu[[t]]];
    matGamma = (#/Total[#]) & /@ (matAlpha matBeta);
    matXi = Array[0. &, {iN, iN}];
    For[t = 1, t <= iT - 1, t++, 
     matXi += (#/Total[Flatten[#]]) &[
      TransMat KroneckerProduct[matAlpha[[t]], 
        matBeta[[t + 1]] matB[[t + 1]]]]];
    TransMat = (#/Total[#]) & /@ matXi;
    ProbVec = matGamma[[1]];
    matWeights = (#/Total[#]) & /@ Transpose[matGamma];
    StateMean = matWeights.Data;
    StateStD = Sqrt[Total /@ (matWeights (Data - # & /@ StateMean)^2)];
    matB = 
     Table[PDF[NormalDistribution[StateMean[[#]], StateStD[[#]]], 
         Data[[t]]] & /@ Range[iN], {t, 1, iT}];
    matAlpha = Array[0. &, {iT, iN}];
    StateNu = Array[0. &, iT];
    matAlpha[[1]] = ProbVec matB[[1]];
    StateNu[[1]] = 1/Total[matAlpha[[1]]];
    matAlpha[[1]] *= StateNu[[1]];
    For[t = 2, t <= iT, t++, 
     matAlpha[[t]] = (matAlpha[[t - 1]].TransMat) matB[[t]]; 
     StateNu[[t]] = 1/Total[matAlpha[[t]]]; 
     matAlpha[[t]] *= StateNu[[t]]];
    StateLogLikelihood = 
     Append[StateLogLikelihood, -Total[Log[StateNu]]];
    If[StateLogLikelihood[[-1]]/StateLogLikelihood[[-2]] - 1 <= nTol, 
     Break[]],
    {iMaxIter}];
   nBIC = -2 StateLogLikelihood[[-1]] + (iN (iN + 2) - 1) Log[iT];
   {"\[Iota]" -> ProbVec, "a" -> TransMat, "\[Mu]" -> StateMean, 
    "\[Sigma]" -> StateStD, "\[Gamma]" -> matGamma, "BIC" -> nBIC, 
    "LL" -> StateLogLikelihood}];

The first function computes the equilibrium distribution for an ergodic Markov chain. The second function generates a random parameter set with inputs of the observed data and the number of states. Output from the second function is a list containing the initial state probability vector, the transition matrix of the Markov chain, the mean of each state, and the standard deviation of each state. The third function is the maximum likelihood estimated fit of the HMM. It takes inputs of the observed data and a random parameter set from the second function. The output from the third function is a list that contains the initial state probability vector, the transition matrix, the mean vector, the standard deviation vector, a vector of periodic state probabilities, the Bayesian Information Criterion (BIC) for the fit, and a history of the loglikelihood for each iteration.

As an example of how to use these functions, lets generate a two state fit with the log differenced crime data. The first step is to generate a random initial parameter set. To hedge the risk of generating a few really bad initial guesses I chose to generate a set of 20 initial guess and run them for 10 iterations. From this set, I then chose the candidate with the highest loglikelihood. This candidate will then be used as a starting point for fitting the model.

guess = HMM[crimedifference[[All, 2]], #, "MaxIterations" -> 10] & /@ 
   Table[RandomInitial[crimedifference[[All, 2]], 2], {20}];

bestguessLL = 
  Flatten[Position[#, Max[#]] &[Last["LL" /. #] & /@ guess]][[1]];

hmmfit = HMM[
   crimedifference[[All, 
    2]], {"\[Iota]", "a", "\[Mu]", 
     "\[Sigma]"} /. (guess[[bestguessLL]])];

It is usually a good idea to check the convergence of the likelihood function to make sure nothing has gone awry.

ListPlot["LL" /. hmmfit, PlotRange -> All]

To cut down on unnecessary typing, I have written a Do loop to evaluate models with different number of states and used a For loop to create a table of the different models with their associated BIC.

MaxGaussians = 3;

Do[InitialGuess["Crimes", i] = 
  HMM[crimedifference[[All, 2]], #, "MaxIterations" -> 10] & /@ 
   Table[RandomInitial[crimedifference[[All, 2]], i], {20}];
 BestGuessLL["Crimes", i] = 
  Flatten[Position[#, Max[#]] &[
     Last["LL" /. #] & /@ InitialGuess["Crimes", i]]][[1]];
 FittedHMM["Crimes", i] = 
  HMM[crimedifference[[All, 
    2]], {"\[Iota]", "a", "\[Mu]", 
     "\[Sigma]"} /. (InitialGuess["Crimes", i][[
      BestGuessLL["Crimes", i]]])],
 {i, 1, MaxGaussians, 1}]

bic = {}; states = {};

For[i = 1, i < MaxGaussians + 1, i++, AppendTo[states, i]; 
 j = "BIC" /. FittedHMM["Crimes", i]; AppendTo[bic, j]]

TableForm[Transpose[{states, bic}], 
 TableHeadings -> {None, {"#States", "BIC"}}, 
 TableAlignments -> Center]

By the BIC, it seems that a HMM with two univariate Gaussian states is the best model. If we examine the Markov chain equilibrium distribution, it seems crimes in Chicago are at a highly volatile with decreasing crimes state about 91% of the time and a low volatility with moderate increases in crimes state about 9% of the time. Looking at the main diagonal of the transition matrix, the large transitions indicate that if crimes in Chicago are found in one state then it will stay in that state for a while.

Print["Equilibrium Distribution = ", 
 MatrixForm[MarkovChainEquilibrium["a" /. FittedHMM["Crimes", 2]]]]

Print["Transition Matrix from Data = ", 
 MatrixForm["a" /. FittedHMM["Crimes", 2]]]

That is all for now, I'll try to create a simulation with the HMM data for the next post. All the files used for this post can be found here.

Tuesday, June 26, 2012

First Attempt at Modeling Crime in Chicago, Part 3

I've been trying to keep my attempts to model crimes in Chicago relatively simple. I was hoping to be able to generate some decent predictions by examining a few months of crime data as a signal using a simple autoregressive moving average model (ARMA). And I've finally made some progress!

I found that a ARMA(2, 3) process acts as a reasonable descriptor for the first difference of the crime data between July 1st, 2011 and March 31st, 2012. The graphs above are simulations for April 2012 that were generated with my regressed ARMA(2, 3) parameters. My choices of using an ARMA model and fitting the first difference of the data is explained in a previous post. The Mathematica code that performs the fitting and diagnostic checks is quite extensive so I may release the work for this post as a package in a later post and document all of the functions then. But if you want to play around with my code and do not mind working with Mathematica, you can find everything used to create this post here (Warning! It is still a work in progress).

Thursday, June 21, 2012

Simple Step to Create a Slicker Site

In my last few posts I had been skimping on a great deal of procedural details and presented only on some fun results. This could be blamed on my dwindling resources of free time and my compulsion to make things presentable. These reasons are significant hurdles to blogging about a semi-programming language like Mathematica, a document markup language like LaTeX, or intermediate-level languages like C++. Other than displaying these codes as plaintext or a cropped screenshot, there are few efficient options to posting them and even fewer to display the code's syntax in a manner that would fit a personalized blog design on Blogger.

A fantastic solution to this problem is the combined use of google-code-prettify and SyntaxHighlighter. The setup of these JavaScripts on Blogger is quite simple, the configurations are limitless, and the results are visually appealing.

For example, if I wanted to write a quick crazy mathematica post on how to gather all the historical drawings for the Mega Millions lottery. It would've taken me some time to create a Blogger-ready-sized png picture of the code to upload to Blogger and it would've annoyed my blog readers because they would have to manually type out the code below.

Using google-code-prettify I can copy and paste my code in nanoseconds and my readers would also be able to use the code to pick their winning Mega Millions numbers right away.

dat[x_] := 
 Drop[Drop[
   Flatten[Import[
     "http://www.usamega.com/mega-millions-history.asp?p=" <> 
      ToString[x] <> "", "Data"], 4], 5], -3]
n = 61;
dat1 = ToString[Flatten[dat[#] & /@ Range[n], 1]];
data = ToExpression[
   StringReplace[dat1, {"\[CenterDot]" -> ",", "+" -> ","}]];
number1 = data[[All, 4]];
number2 = data[[All, 5]];
number3 = data[[All, 6]];
number4 = data[[All, 7]];
number5 = data[[All, 8]];
megaball = data[[All, 9]];

To get google-code-prettify set up on Blogger, you will need a free DropBox or box account and the three files I provide here. The lang-mma.js and prettify.css files I have provided for this post were forked from the Mathematica - Stack Exchange. Then follow these steps:

Place the uncompressed files anywhere in your preferred cloud drive account.

Get the sharable hyperlinks of those three files.

In Blogger (with the new 2012 interface) go to the Template tab on the left, then navigate to the Edit HTML button and proceed.

Copy and paste the following code into your HTML template (best spot would be right before </head>) and insert the appropriate links in step 2 and save.

Tuesday, June 12, 2012

First Attempt at Modeling Crime in Chicago, Part 2

In the last post, my main goal was to develop a Mathematica 8 program that could regress coefficients for general autoregressive (AR) models. In my excitement, I skipped an essential step of data modeling. I forgot to check if the data could be accurately described by an AR model.

To keep things simple for the remainder of my attempts to model the number of daily crimes in Chicago, I will be using only the data between July 1st, 2011 and March 31st, 2012. This will also give me a reason to use the new Data section of my blog.

Before examining the legitimacy of an AR model selection, I need to check if the data is stationary. The basis for any time series analysis is a stationary time series, because essentially we can only develop models and forecasts for stationary time series. In my experience, I have found no clear demarcation between stationary and non-stationary data. The usual approach to determine stationarity is to plot the autocorrelation function and if the plot doesn't dampen at long lags than the data is most likely not stationary. This logic escapes me. It may be because my statistical skills come from my engineering and computational chemistry training but I don't think the autocorrelation function is defined for non-stationary processes.

One of the tricks to transform a non-stationary dataset to a stationary one is to take the difference of the non-stationary dataset. So for me, a variogram is the more logical tool for determining stationarity. A variogram gives a ratio of the variance of differences some k time units apart and the variance of the differences only one time unit apart. If you look at the equations below, as k goes to infinity the difference between k lag and k+1 lag will eventually be the same. Ideally, you can conclude a dataset is stationary when the variogram shows a stable asymptote.

\[G_{k}=\frac{V(z_{t+k}-z_{t})}{V(z_{t+1}-z_{t})},\;k=1,2,3...\]

where

\[V(z_{t+k}-z_{t})=\frac{\sum_{t=1}^{n-k}(d^{ k}_{t}-(n-k)^{-1}\sum d^{ k}_{t})^2}{n-k-1}\]
\[d^{k}_{t}=z_{t+k}-z_{t}\]

In Mathematica, a sample variogram can be calculated accordingly:

LagDifferences[list_, k_Integer?Positive] /; k < Length[list] := Drop[list, k] - Drop[list, -k];

Variogram[list_, k_] := (#/First@#) &@(Variance /@ Table[LagDifferences[list, i], {i, k}]);

Now we can finally move on to model selection. As a rule of thumb, an AR(p) model can adequately model a set of data if the autocorrelation function (Acf) plot looks like an infinitely damped exponential or sine wave that tails off and the partial autocorrelation function (PAcf) plot is cut off after p lags. Plots of the variogram, autocorrelation function, and partial autocorrelation function for the Chicago crime data and differences of the data are shown below.

	Variogram
Crime Series
1st Difference
2nd Difference

	Autocorrelation	Partial Autocorrelation
Crime Series
1st Difference
2nd Difference

It looks like taking the first difference will produce a more stationary time series. The large negative autocorrelation at lag 1 in both the Acf and PAcf plot suggests the data may be better described with a moving average model component rather than with just an autoregressive model. Looks like it is bullocks on me for jumping the gun. I'll be back next post with a moving average model made in Mathematica. Until then, have fun with the source files for this post here.

Tuesday, June 5, 2012

First Attempt at Modeling Crime in Chicago

In my last post, I showed how to use the Socrata Open Data API (SODA) to download the data of the crimes reported in Chicago, how to plot the locations, and how to count the number of crimes that occurred in a certain area. This post will chronicle my first attempts to model that data in Mathematica.

The time series analysis add-on to Mathematica 8 costs about $295. So I've decided to write my own autoregressive model (AR) package as a start to modeling the crimes in Chicago. My AR code in Mathematica is based off the FitAR R-package and its associated paper. The mathematics and derivations of autoregressive models are already heavily covered on other websites, so I will not be explaining it here. The alpha version of my code can be found here, note it is not complete and does not yet have all the functions of FitAR or the Mathematica time series add-on.

My code, at this point, does provide fits that are comparable to the FitAR package. Using the default "lynx" data in R the following fits and associated residual autocorrelation plots were produced:

Mathematica

*AR(1) Fit*
	MLE	sd	Z-ratio
phi(1)	0.71729	0.0652589	10.9914
mu	1537.94	364.477	4.21957


	MLE	sd	Z-ratio
phi(1)	1.12428	0.0905874	12.411
phi(2)	-0.716667	0.136707	-5.24237
phi(3)	0.262661	0.136707	1.92135
phi(4)	-0.253983	0.0905874	-2.80374
mu	1537.94	135.755	11.3288


	MLE	sd	Z-ratio
phi(1)	0.820455	0.0178649	45.9256
phi(2)	-0.632818	0.0972914	-6.50436
phi(4)	-0.141951	0.0684639	-2.07337
phi(5)	0.141893	0.0748193	1.89647
phi(7)	0.202038	0.0992876	2.03488
phi(10)	-0.314105	0.0917768	-3.42249
phi(11)	-0.368658	0.0870617	-4.23444
mu	6.68591	0.100657	66.4225

*AR(1) Fit*
	MLE	sd	Z-ratio
phi(1)	0.717303	0.0652577	10.9918
mu	1538.02	363.986	4.22548


	MLE	sd	Z-ratio
phi(1)	1.12463	0.0905803	12.4158
phi(2)	-0.717396	0.1367150	-5.24738
phi(3)	0.263355	0.136715	1.92630
phi(4)	-0.254273	0.0905803	-2.80716
mu	1538.02	135.469	11.3533


	MLE	sd	Z-ratio
phi(1)	0.8204490	0.01786005	45.937651
phi(2)	-0.6328433	0.09731087	-6.503316
phi(4)	-0.1420888	0.06847123	-2.075161
phi(5)	0.1421388	0.07481409	1.899894
phi(7)	0.2021250	0.09927376	2.036036
phi(10)	-0.3140954	0.09178210	-3.422186
phi(11)	-0.3686589	0.08706171	-4.234455
mu	6.6859329	0.09999207	66.864631

Now that I know my code is correct, the next step will be figuring out if an autoregressive model can properly describe the crime data from Chicago. The lynx data from above is almost trivial when compared to the chaotic mess of crime data.

Time Series Plot of lynx data

Plot of daily totals of crimes in Chicago.

Just for fun, this is the residual autocorrelation of an AR(5) regressed with the crime series data from above:

Not a model anyone should ever depend on. The files used in this post can be found here.

Monday, May 7, 2012

Chicago Crime - Updated

I was a bit dismayed to discover that the EveryBlock API stopped working. One of my most popular posts, Chicago Crime, depended on this API heavily.

Here I present an alternative data source for my previous post and a fun update. The update is a quick and dirty method to create popular crime density plots, as shown below.

In the spirit of transparency, the city of Chicago has taken the initiative to electronically release a great deal of data with Socrata Open Data API (SODA). Luckily for us, this release includes all filed police reports. The records available through this API range from the year 2001 to the present, which dwarfs the two week range provided by EveryBlock's API. However, I would not recommend downloading all of this data as the text file is approximately 2GB in size.

In this post I use only the filed police reports from one year prior to present. The data is available in both JSON and CSV format. I only use the CSV format here for the convenience of those with an earlier version of Mathematica. To import this data into Mathematica:

A plot of all reported crimes in Chicago between February 1st and March 31st of 2012 was created with the data.

Plot of incidents for any period within the year can be easily created with the following Mathematica code. Restrict the data to the range of interest by replacing the dates of the start and end strings. Dates must be in the format of mm/dd/yyyy. This code simply finds the positions of the strings that match the inputed dates and plots the daily totals of incidents.

The data can be parsed into primary descriptions including arson, homicide, sex offense, weapons violation, crim sexual assault, interfere with public officer, gambling, public peace violation, criminal trespass, liquor law violation, assault, deceptive practice, burglary, criminal damage, robbery, motor vehicle theft, offense involving children, narcotics, theft, other offense, and battery. Below is the plot and code to filter for only narcotics incidents.

A secondary description can also be used to parse the data. Some secondary descriptions include aggravated: handgun, poss: crack, armed: handgun, harassment by telephone, poss: heroin(white), to land, retail theft, unlawful entry, strongarm - no weapon, to property, poss: cannabis 30gms or less, from building, $500 and under, forcible entry, to vehicle, domestic battery simple, simple, automobile, over $500, telephone threat, and wireroom/sports. Below is the plot and code to filter for incidents with a primary description of narcotics and a secondary description of poss: cannabis 30gms or less.

The incidents may also be separated into ones where arrests were made and ones where no arrests were made. Below is code to filter for incidents with a primary description of narcotics and a secondary description of poss: cannabis 30gms or less with arrests and without. Both plots comparing to total narcotics incidents are shown.

The method to display where certain incidents occur in chicago is similar to my previous post, Chicago Crime. Please refer to that posting for a more detailed explanation on how to create the plot below. The plot here is a geographical visualization of where arrests occurred relating to incidents of narcotics with possession of cannabis with 30 grams or less.

The method to tabulate the number of incidents within neighborhoods is also similar to my previous post, Chicago Crime. This calculation totals the numbers of incidents within each neighborhood boundary. It is the most costly in computational time and power because of the generality in the counting code that allows it to use any boundaries (such as census tracts, police beats, etc.) The tabulation of arrests made relating to incidents of narcotics with possession of cannabis with 30 grams or less in each neighborhood is presented.

While doing some initial research for this post, I found "crime hotspots" as a popular visualization option. A similar visualization is easily created in mathematica with three simple lines of code.

The Mathematica code used in this post can be downloaded here. Some fun analysis of crimes in chicago will follow in future posts.