Linear Regression with IBM Netezza’s Analytics package

In the IBM Netezza Analytics suite exists a Linear Regression function that you can use (very easily) to identify intercept and slope between a dependent and any number of independent variables.  I’m no math nut, but the presentation of this is simple enough that just about anyone can digest it and implement.

The name of the function is LINEAR_REGRESSION and it exists in the NZA database.  To setup this example, create a table called crickets as shown below:

create table crickets (id integer, chirps integer, temp integer);

Next, populate the table with some sample data.

insert into crickets values ( 1, 41, 50 );
insert into crickets values ( 2,59, 55 );
insert into crickets values ( 3, 82, 60 );
insert into crickets values ( 4, 101, 65 );
insert into crickets values ( 5, 118, 70 );
insert into crickets values ( 6, 139, 75 );
insert into crickets values ( 7, 158, 80 );

First things first, let’s confirm a strong correlation exists between the chirps (per minute) and temperature.  Using the correlation function, this couldn’t be easier:

MYDATABASE(ADMIN)=> call nza..corr('intable=crickets,x=chirps,y=temp');
       CORR
------------------
 0.99956079739036

In other words, a very strong and clear correlation exists between the two.  Next, let’s run the linear regression function to get the slope and intercept — making it possible to predict either temp or chirps if one is known.

MYDATABASE(ADMIN)=> call nza..linear_regression('model=crickets,intable=crickets,id=id,target=chirps');                                                                 NOTICE:  Converting predictors from table crickets to Database Matrix Object...
NOTICE:  Converting predicted values from table crickets to Database Matrix Object...
NOTICE:  Performing linear regression...
NOTICE:  0 job(s) currently queued.
NOTICE:  Queueing job 498534729...
NOTICE:  Job started.
NOTICE:  Sending matrix CRICKETS_linearmodel1012764527intmodel...
NOTICE:  Sending matrix CRICKETS_linearmodel1012764527intmodel...
NOTICE:  Performing matrix multiplication...
NOTICE:  Finished multiplication after 0 seconds.

NOTICE:  Retrieving result...
NOTICE:  0 job(s) currently queued.
NOTICE:  Queueing job 573656007...
NOTICE:  Job started.
NOTICE:  Sending matrix CRICKETS_LINEARMODEL_predicted...
NOTICE:  Sending matrix CRICKETS_linearmodel1012764527intmodel...
NOTICE:  Performing matrix multiplication...
NOTICE:  Finished multiplication after 0 seconds.

NOTICE:  Retrieving result...
NOTICE:  0 job(s) currently queued.
NOTICE:  Queueing job 326060254...
NOTICE:  Job started.
NOTICE:  Sending matrix CRICKETS_linearmodel1012764527_xty...
NOTICE:  Sending matrix CRICKETS_linearmodel1012764527_xtx...
NOTICE:  Computing linear least squares solution...
NOTICE:  Finished solve after 4 seconds.

NOTICE:  Retrieving result...
NOTICE:  *** Number of estimated parameters: 2
NOTICE:  Model coefficients saved as table "MYDATABASE".."CRICKETS_LINEARMODEL".
 LINEAR_REGRESSION
-------------------
 t
(1 row)

So now what?  Well, a new table has been created called CRICKETS_LINEARMODEL and it contains both the slope and intercept — two values necessary to predict either chirps or temp when one is not known.  Looking at that table:

MYDATABASE(ADMIN)=> select var_name, value from crickets_linearmodel;l
  VAR_NAME   |      VALUE
-------------+-----------------
 TEMP        | 3.9071428571429
 (Intercept) |         -154.25
(2 rows)

So our formula becomes:

Chirps = 3.91(temp) -154.25

Let’s look at one of the known sets and see if this works out.  Since we have a table with the chirps, temp and another with slope and intercept, let’s see how close we are.

select      crickets.*,slope.value * temp + intercept.value pred_chirps, (chirps +(-1*intercept.value))/slope.value pred_temp from crickets, crickets_linearmodel slope, crickets_linearmodel intercept  where slope.var_name = ‘TEMP’ and intercept.var_name = ‘(Intercept)’ order by id;

The result:

 CHIRPS | TEMP  | ID |   PRED_CHIRPS   |    PRED_TEMP
--------+-------+----+-----------------+-----------------
  41.00 | 50.00 |  1 | 41.107142857145 | 49.972577696526
  59.00 | 55.00 |  2 | 60.642857142859 | 54.579524680073
  82.00 | 60.00 |  3 | 80.178571428574 | 60.466179159049
 101.00 | 65.00 |  4 | 99.714285714289 | 65.329067641681
 118.00 | 70.00 |  5 |          119.25 | 69.680073126142
 139.00 | 75.00 |  6 | 138.78571428572 | 75.054844606946
 158.00 | 80.00 |  7 | 158.32142857143 | 79.917733089579

Now that is not too shabby…..

Advertisements
This entry was posted in analytics, General and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s