Passing parameters to Pig scripts

The Pig scripting language Pig Latin allows for parameter substitution at run-time.  Like any script, the ability to define parameters makes it far easier to share code with other users.  To do this in Pig Latin, you simply modify your script as shown below:

ALL_PLAYER_STATS = load '/user/hduser/baseball/batting/*.csv' using PigStorage (',') as ( playerID:chararray, teamID:chararray, yearID:int);
FILTERED_TEAM = filter ALL_PLAYER_STATS by teamID == '$TEAMID';
dump FILTERED_TEAM;

Then, when you want to execute your script you specify the value like this:

pig -x mapreduce -p TEAMID=BOS batting.pig

Failure to specify the parameter at run-time will throw the following error:

ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Undefined parameter : TEAMID

You can specify as many parameters as you like with the -p option.  For example:

ALL_PLAYER_STATS = load '/user/hduser/baseball/batting/*csv' using PigStorage (',') as
    ( playerID:chararray, teamID:chararray, yearID:int);
FILTERED_TEAM = filter ALL_PLAYER_STATS by teamID == '$TEAMID' and yearID >= $MIN_YEAR; 
dump FILTERED_TEAM;

Notice that because yearID is defined as int in my schema that I dropped the single quotes around it.  Failure to do so will cause Pig to treat the value as a string – which will not match the type defined in the schema.  The error you’ll see is:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: 
<file test.pig, line 2, column 50> In alias FILTERED_TEAM, incompatible 
types in GreaterThanEqual Operator left hand side:int right hand side:chararray

To execute the multi-parameter script:

pig -x mapreduce -p TEAMID=NYY -p YEARID=2001 batting.pig

Posted in hadoop, pig, scripting | Tagged , | Leave a comment

Could not infer the matching function for org.apache.pig.builtin.SUM (or any function for that matter)

Pig – the language – may be like Pig – the animal – when it comes to ingesting data (not very picky), but syntax certainly does matter.  I learned this tonight while experimenting with Pig.  My script was pretty simple:

1. Load some data

2. Filter that data

3. Group that data

4. Aggregate that data

5. Sort that data

6. Limit the data to the top 10

7. Dump the data

What I didn’t realize is that the variable that populates one variable is very important – when it comes to aggregation and inferring the appropriate data type.  For example, this will fail with the error found in the post heading:

mydata = load '/user/hduser/foo.dat' using PigStorage(',') as (person:chararray,val:int);
mydata_filtered = filter mydata by val > 1;
group_mydata = group mydata_filtered by ( person );
sumval = foreach group_mydata generate group, SUM(mydata.val);
dump sumval;

The reason is that the variable that was used to populate group_mydata is mydata_filtered and not mydata.  This was a pretty sloppy mistake that I made when adding additional logic to filter the data AFTER the original script was written.  So if you get this error:

Could not infer the matching function for org.apache.pig.builtin.SUM

Confirm you are aggregating the column from the appropriate Pig relation and not an earlier created one (though it may be the parent!).

Posted in hadoop, pig, scripting | Leave a comment

Configuring pig to work with a remote Hadoop cluster

1. First, download a stable release of Pig from here.

2. As root (or some other privileged user), untar the pig tarball to /usr/local; this will create a sub-directory like /usr/local/pig.0.11.1.

3. Create a symbolic link (to make things easier)

ln -s /usr/local/pig.0.11.0 /usr/local/pig

4. Update your .bashrc or .profile to include:

export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

5.  Contact your Hadoop administrator (or get it yourself if you have access) and create a tarball containing the necessary client files:

cd $HADOOP_HOME
tar -czvf client.tar.z core-site.xml hadoop-env.sh hdfs-site.xml log4j.properties mapred-site.xml ssl-client.xml.example

6.  Now, create a new directory (either in your home-directory or some place else if others are going to need to access this:

mkdir hadoop.conf
cd hadoop.conf
tar -zxvf ../client.tar.z .

7.  Now update your .profile or .bashrc to include this line:

export HADOOP_CONF_DIR=$HOME/hadoop.conf

8.  If it isn’t already, export JAVA_HOME in your .profile or .bashrc:

export JAVA_HOME=/usr/local/jdk1.7.0_17

9.  Run pig in interactive mode (but mapreduce execution):

pig -x mapreduce

10.  Test it all out with an actual pig script.  Copy and paste the following into wordcount.pig:

documents = LOAD '/user/hduser/foo_input/*.txt' as line;
words = foreach documents generate flatten(TOKENIZE(line)) as word;
grpd = group words by word;
cntd = foreach grpd generate group, COUNT(words);
dump cntd;Change the directory to something on your HDFS that actually has a bunch of text documents.

Run it:

pig -x mapreduce wordcount.pig

If everything is setup correctly, you’ll get a listing of words encountered and the number of times they were encountered.

Posted in hadoop, pig, scripting, Uncategorized | Tagged , , | 1 Comment

Correlation Analysis with IBM’s PureData for Analytics (Netezza)

IBM PureData systems come with a lengthy list of in-database analytic functions that you can run without having to export the data.  One of the easiest functions to understand and implement is the correlation analysis.

Correlation is an association between two things.  In a statistical sense, it is the the level of dependence observed in two otherwise random variables.  Common applications of this type of analysis involve price and quantity sold, height and weight, salary and education, etc.  The value returned will fall between -1 and 1, with 0 representing two completely independent variables.  The closer to 1 or -1 the value is, the stronger the linear (either positive or negative) relationship between the two variables.

In this example, we’ll use the in-database function ‘corr’ to examine the relationship between MLB team wins and other variables (doubles, fielding percentage and home runs).  I’ve loaded my tables with data found in the Lahman database (found here: http://www.seanlahman.com/baseball-archive/statistics/).

First, let’s check home runs and wins:

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;homeruns');
CORR
------------------
0.48312465334691
(1 row)

This shows a strong positive correlation between home runs and wins.  Now, let’s check fielding percentage.  Do defensive-minded teams fare better in the long run?

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;fielding_pct');                                                                                          CORR
------------------
 0.64736582698989
(1 row)

This shows a significantly stronger positive relationship between fielding percentage and wins than home runs and wins.  Now, because these functions are all parameterized we can do all sorts of tests without having to really write any code.  For example, let’s say I wanted to test the hypothesis that defensive-minded teams fare better in the National League than in the American League.  I could create two tables and split the two leagues up and then run this function twice.  Or, I could simply add a new parameter to this function: by.

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;fielding_pct,by=lgid,outtable=corr_lgid');
select * from co CORR
------
7
(1 row)

Note: not only did we add the column that we wanted the analysis to be done by, we had to also add an output table to be created.  In our case, there were 7 distinct league IDs detected (baseball has been around for a long time).  Anyway, on to the results.  Do defensive-minded teams fare better in the National League than in the American League?

CORRELATION    | LGID | MINY | MAXY
------------------+------+------+------
0.53818665342968 | FL   | 1914 | 1915
0.37506490953436 | UA   | 1884 | 1884
0.60313781153775 | AA   | 1882 | 1891
0.19473896983698 | PL   | 1890 | 1890
0.7064554926354 | NA   | 1871 | 1875
0.29908050110522 | AL   | 1901 | 2012
 0.58412825037398 | NL   | 1876 | 2012

With the exception of some older leagues that no longer exist, the National League teams with higher fielding percentages also tend to have higher wins.  Let’s revisit the home run analysis:

MLB_STATS.ADMIN(ADMIN)=> call nza..corr('intable=teams,incolumn=wins;homeruns,by=lgid,outtable=corr_lgid');                                                             CORR
------
7
(1 row)

And there results?

CORRELATION     | LGID | MINY | MAXY
--------------------+------+------+------
-0.012053699189733 | FL   | 1914 | 1915
0.6688325122043 | PL   | 1890 | 1890
0.93686282512589 | UA   | 1884 | 1884
  0.44078445166504 | NL   | 1876 | 2012
0.68876899447418 | NA   | 1871 | 1875
   0.37788414103078 | AL   | 1901 | 2012
0.65113045869159 | AA   | 1882 | 1891

Both the AL and NL show similar levels of dependence on home runs.  Each of these executions took several seconds to run and in a very short amount of time I was able to test a few different theories using data and statistical analysis.  Many other functions exist that we could use to build off of this one.  For example, assuming a strong linear relationship does exist (like with NL teams and fielding percentages), then we could build a predictive model using the linear regression function…. but not this time.

Posted in analytics, Netezza, PureData for Analytics | Leave a comment

Getting to a blades console from IBM’s IMM (and then getting out!)

Because I don’t do this often, I always forget how to get out of console mode via IBM’s IMM (integrated management module).  I should add that there is a way to do this via the web-based GUI but I’m a CLI guy.

To connect to the blade, you first need to connect to the IMM:

ssh USERID@imm01

Enter your password when prompted.  At the prompt, you now need to change the ‘environment’ you are controlling.  You do this with the env command and a switch “-T” – for target.

system> env -T blade[1]

OK

Your command prompt now reflects the new target.

system:blade[1]>

Here, we can enter the console mode by simply typing

system:blade[1]> console -l

This will bring up the live running console on that system (assuming it is running).  Now, to exit out of console mode you have to enter this sequence of characters:

CTRL+[

SHIFT+(

So that’s hold down CTRL and then the left square bracket; release, hold down shift and left open parenthesis.  More often than not, I forget this sequence and have to Google it.

Posted in Uncategorized | Leave a comment

Accessing Facebook Graphs search from Jaql

To use Facebook’s graph search, you need to register first as a developer and then create an app on the Facebook developer site.  Once you have done this, you need to identify the following values for your app: App ID & App Secret.  You’ll find this information on your application’s basic information page.

Now, to get Jaql to use it and search Facebook graph:

term = “Facebook+graph”;
url = “https://graph.facebook.com/search?q=&#8221; + term + “&access_token=MY_APP_ID%7CMY_APP_SECRET”;
results = read(http(url));

Replace MY_APP_ID with your application’s ID and MY_APP_SECRET with your application’s secret string.

Please note: the %7C found between MY_APP_ID and MY_APP_SECRET is required; the pipe character is considered insecure and your Jaql script will fail as a result.  The error will be:

java.net.URISyntaxException: Illegal character in query at index ##

You’ll now be able to work with the json data produced by the query.

Posted in Uncategorized | Tagged , , | Leave a comment

Setting custom properties for Hive databases

Hive supports the concept of a database as a logical collection of objects stored in separate catalogs or namespaces. One neat thing that you can do with Hive is add extended properties to your database that are displayed when describing a database.

For example, create a database:

hive> create database if not exists mydatabase;
OK
Time taken: 0.095 seconds

Next, describe the database and compare the output to the extended description of the database:

hive> describe database mydatabase;
OK
mydatabase hdfs://bivm:9000/biginsights/hive/warehouse/mydatabase.db
Time taken: 0.17 seconds

hive> describe database extended mydatabase;
OK
mydatabase hdfs://bivm:9000/biginsights/hive/warehouse/mydatabase.db
Time taken: 0.119 seconds

Now, if you want to add some additional properties that will be displayed in the extended output you do that by simply adding the following to your create database command:

hive> create database if not exists mydatabase
with dbproperties(‘Experiment Name’ = ‘Correlation age/sentiment’,
‘date’ = ‘2013-07-11’,
‘Lead Developer’ = ‘John Foo’,
‘Lead Developer Email’ = ‘jfoo@somewhere.com’);
OK

Now when you display extended properties you will see the new information added:

hive> describe database extended mydatabase;
OK
mydatabase hdfs://bivm:9000/biginsights/hive/warehouse/mydatabase.db {Lead Developer Email=jfoo@somewhere.com, Lead Developer=John Foo, Experiment Name=Correlation age/sentiment, date=2013-07-11}
Time taken: 0.092 seconds

Time taken: 0.077 seconds

Posted in administration, General, Uncategorized | Tagged , , | Leave a comment