Correlation Analysis with IBM’s PureData for Analytics (Netezza)

IBM PureData systems come with a lengthy list of in-database analytic functions that you can run without having to export the data.  One of the easiest functions to understand and implement is the correlation analysis.

Correlation is an association between two things.  In a statistical sense, it is the the level of dependence observed in two otherwise random variables.  Common applications of this type of analysis involve price and quantity sold, height and weight, salary and education, etc.  The value returned will fall between -1 and 1, with 0 representing two completely independent variables.  The closer to 1 or -1 the value is, the stronger the linear (either positive or negative) relationship between the two variables.

In this example, we’ll use the in-database function ‘corr’ to examine the relationship between MLB team wins and other variables (doubles, fielding percentage and home runs).  I’ve loaded my tables with data found in the Lahman database (found here: http://www.seanlahman.com/baseball-archive/statistics/).

First, let’s check home runs and wins:

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;homeruns');
CORR
------------------
0.48312465334691
(1 row)

This shows a strong positive correlation between home runs and wins.  Now, let’s check fielding percentage.  Do defensive-minded teams fare better in the long run?

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;fielding_pct');                                                                                          CORR
------------------
 0.64736582698989
(1 row)

This shows a significantly stronger positive relationship between fielding percentage and wins than home runs and wins.  Now, because these functions are all parameterized we can do all sorts of tests without having to really write any code.  For example, let’s say I wanted to test the hypothesis that defensive-minded teams fare better in the National League than in the American League.  I could create two tables and split the two leagues up and then run this function twice.  Or, I could simply add a new parameter to this function: by.

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;fielding_pct,by=lgid,outtable=corr_lgid');
select * from co CORR
------
7
(1 row)

Note: not only did we add the column that we wanted the analysis to be done by, we had to also add an output table to be created.  In our case, there were 7 distinct league IDs detected (baseball has been around for a long time).  Anyway, on to the results.  Do defensive-minded teams fare better in the National League than in the American League?

CORRELATION    | LGID | MINY | MAXY
------------------+------+------+------
0.53818665342968 | FL   | 1914 | 1915
0.37506490953436 | UA   | 1884 | 1884
0.60313781153775 | AA   | 1882 | 1891
0.19473896983698 | PL   | 1890 | 1890
0.7064554926354 | NA   | 1871 | 1875
0.29908050110522 | AL   | 1901 | 2012
 0.58412825037398 | NL   | 1876 | 2012

With the exception of some older leagues that no longer exist, the National League teams with higher fielding percentages also tend to have higher wins.  Let’s revisit the home run analysis:

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;homeruns,by=lgid,outtable=corr_lgid');                                                             CORR
------
7
(1 row)

And there results?

CORRELATION     | LGID | MINY | MAXY
--------------------+------+------+------
-0.012053699189733 | FL   | 1914 | 1915
0.6688325122043 | PL   | 1890 | 1890
0.93686282512589 | UA   | 1884 | 1884
  0.44078445166504 | NL   | 1876 | 2012
0.68876899447418 | NA   | 1871 | 1875
   0.37788414103078 | AL   | 1901 | 2012
0.65113045869159 | AA   | 1882 | 1891

Both the AL and NL show similar levels of dependence on home runs.  Each of these executions took several seconds to run and in a very short amount of time I was able to test a few different theories using data and statistical analysis.  Many other functions exist that we could use to build off of this one.  For example, assuming a strong linear relationship does exist (like with NL teams and fielding percentages), then we could build a predictive model using the linear regression function…. but not this time.

Advertisements
This entry was posted in analytics, Netezza, PureData for Analytics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s