IBM PureData systems come with a lengthy list of in-database analytic functions that you can run without having to export the data. One of the easiest functions to understand and implement is the correlation analysis.
Correlation is an association between two things. In a statistical sense, it is the the level of dependence observed in two otherwise random variables. Common applications of this type of analysis involve price and quantity sold, height and weight, salary and education, etc. The value returned will fall between -1 and 1, with 0 representing two completely independent variables. The closer to 1 or -1 the value is, the stronger the linear (either positive or negative) relationship between the two variables.
In this example, we’ll use the in-database function ‘corr’ to examine the relationship between MLB team wins and other variables (doubles, fielding percentage and home runs). I’ve loaded my tables with data found in the Lahman database (found here: http://www.seanlahman.com/baseball-archive/statistics/).
First, let’s check home runs and wins:
MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;homeruns');
This shows a strong positive correlation between home runs and wins. Now, let’s check fielding percentage. Do defensive-minded teams fare better in the long run?
MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;fielding_pct'); CORR
This shows a significantly stronger positive relationship between fielding percentage and wins than home runs and wins. Now, because these functions are all parameterized we can do all sorts of tests without having to really write any code. For example, let’s say I wanted to test the hypothesis that defensive-minded teams fare better in the National League than in the American League. I could create two tables and split the two leagues up and then run this function twice. Or, I could simply add a new parameter to this function: by.
MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;fielding_pct,by=lgid,outtable=corr_lgid');
select * from co CORR
Note: not only did we add the column that we wanted the analysis to be done by, we had to also add an output table to be created. In our case, there were 7 distinct league IDs detected (baseball has been around for a long time). Anyway, on to the results. Do defensive-minded teams fare better in the National League than in the American League?
CORRELATION | LGID | MINY | MAXY
0.53818665342968 | FL | 1914 | 1915
0.37506490953436 | UA | 1884 | 1884
0.60313781153775 | AA | 1882 | 1891
0.19473896983698 | PL | 1890 | 1890
0.7064554926354 | NA | 1871 | 1875
0.29908050110522 | AL | 1901 | 2012
0.58412825037398 | NL | 1876 | 2012
With the exception of some older leagues that no longer exist, the National League teams with higher fielding percentages also tend to have higher wins. Let’s revisit the home run analysis:
MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;homeruns,by=lgid,outtable=corr_lgid'); CORR
And there results?
CORRELATION | LGID | MINY | MAXY
-0.012053699189733 | FL | 1914 | 1915
0.6688325122043 | PL | 1890 | 1890
0.93686282512589 | UA | 1884 | 1884
0.44078445166504 | NL | 1876 | 2012
0.68876899447418 | NA | 1871 | 1875
0.37788414103078 | AL | 1901 | 2012
0.65113045869159 | AA | 1882 | 1891
Both the AL and NL show similar levels of dependence on home runs. Each of these executions took several seconds to run and in a very short amount of time I was able to test a few different theories using data and statistical analysis. Many other functions exist that we could use to build off of this one. For example, assuming a strong linear relationship does exist (like with NL teams and fielding percentages), then we could build a predictive model using the linear regression function…. but not this time.