Using R to improve your fantasy football team

So I’ve started playing around with R and this week decided to see if I could more intelligently add a player to my team from the ranks of free agency.  The position I needed to fill?  The kicker.

The first thing I did was to grab the YTD data for kickers and store it into a space delimited text file in the /tmp directory called kickers:

sgost 12 6 14 14 7 14 10 10 13 
nfolk 12 4 10 9 13 6 13 13 16
mprat 7 13 17 12 18 5 10 9 4
mcros 4 8 13 20 15 6 15 8 5
dbail 13 14 6 2 12 5 17 11 5
shaus 7 9 9 13 16 13 2 9 19
rsucc 4 5 13 9 15 6 5 12 12
avina 3 7 10 14 10 13 16 6 0
dcarp 3 12 13 11  8 11 4 7 4
nnova 4 18 5 13 4 15 6 6 8
ahenr 8 12 3 7 20 2 1 7 9
gcano 1 11 10 8 5 14 7 12 5
mnuge 3 9 2 7 9 10 11 10 4
pdaws 9 3 1 3 10 13 8 6 12
cstur 13 8 12 5 11 2 5 8 8

Next, I opened R.  I am running this on Linux – so I simply typed ‘R’ to launch the client.

First things first, we need to load the data into a list.

kickers <- read.table('/tmp/kickers', col.names=c('kicker','week1','week2','week3','week4','week5','week6','week7','week8','week9'))

This produces a list with the aforementioned column names.  But this is doesn’t really give us any idea as to how consistent a kicker is or how their averages may have been affected by weeks where they significantly outperformed their typical performance.  I decided to use a boxplot to visualize the data for this purpose.

To do that, I needed to transpose the data so that each kicker had their own column with YTD results.  R has the ‘t’ function for this very purpose.

tkicker = t(kickers[,-1])

This creates a new list with all of the data values – but excludes the first column which contained the kicker’s name; we have plans for that:

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
week1   12   12    7    4   13    7    4    3    3     4     8     1     3
week2    6    4   13    8   14    9    5    7   12    18    12    11     9
week3   14   10   17   13    6    9   13   10   13     5     3    10     2
week4   14    9   12   20    2   13    9   14   11    13     7     8     7
week5    7   13   18   15   12   16   15   10    8     4    20     5     9
week6   14    6    5    6    5   13    6   13   11    15     2    14    10
week7   10   13   10   15   17    2    5   16    4     6     1     7    11
week8   10   13    9    8   11    9   12    6    7     6     7    12    10
week9   13   16    4    5    5   19   12    0    4     8     9     5     4
      [,14] [,15]
week1     9    13
week2     3     8
week3     1    12
week4     3     5
week5    10    11
week6    13     2
week7     8     5
week8     6     8
week9    12     8

Next, let’s add the kicker’s name as a column header:

colnames(tkicker) <- kickers$kicker

Now we have something we can boxplot:

boxplot(tkicker)

This produces the following graphic:

Screenshot from 2013-11-15 11:27:02

Good – but it doesn’t stand out.  Let’s add some color to this graphic:

boxplot(tkicker, col=colors())

Screenshot from 2013-11-15 11:35:28

From this, we can see Stephen Gostkowski is by far the most consistent kicker with a narrow IQR and a higher than most median average.  Unfortunately – if your league is like my league – he isn’t available.  From the list, however, I could see Nick Folk was a good choice.  His median is also quite high though he has some outlying performances that brought that value up.  His IQR was still relatively narrow – and he did have some poor performances against the Steelers and Patriots.  I picked him up based on this visual; I’ll let you know how I made out next week.

Posted in Uncategorized | 1 Comment

“Error occurred while loading translation library” when connecting R to IBM Netezza

When connecting my R-2.15 client to IBM Netezza v7 (NZA 2.5.4) for the first time, I got the error above.  Here was the connect call:

>nzConnectDSN("NZSQL")
Error in odbcDriverConnect("DSN=VirtualNZ") : 
  (converted from warning) [RODBC] ERROR: state HY000, code 45, message Error occurred while loading translation library

Check to make sure you are running the 64bit version of R client.  Once I switched to that version the error went away.  It might be that installing the 32bit Netezza ODBC driver resolves this also.  In my case, I wasn’t interested in getting the 32bit client working.

Posted in Uncategorized | Leave a comment

Securing (and sharing) password information in Sqoop jobs

Sqoop is a utility that allows you to move data from a relational database system to an HDFS file system (or export from Hadoop to RDBMS!).  One of the things to keep in mind as you start building Sqoop jobs is that the password information shouldn’t be passed via the command line.

Sqoop has a couple of ways to secure this information, one of which is creating a more secure parameters file that you pass to Sqoop at runtime.  For example:

1. Create a file containing the connection string information in your UNIX/Linux home directory:

--connect jdbc:postgresql://mypostgres.server.com:5432/mydatabase 
--user hduser
--password 'password'

2.  Secure that file by changing the permissions to owner read-only

chmod 400

3.  Modify the appropriate Sqoop jobs to use this file

sqoop import --table mytable --options-file pg.parms

Another way to secure this information is with a password file stored on the HDFS file system itself; writing that one up next.

Posted in General, hadoop, scripting, sqoop | Tagged , , | Leave a comment

Pig workflow optimization: splitting data flows

Pig supports the concept of non-linear data flows, where you have a single input but multiple outputs.  Pig’s optimizer is smart enough to recognize when the same input is referenced multiple times and implicitly splits those data flows.  You can explicitly do it with the split function as shown below.  Personally, I prefer this approach because it seems slightly easier to maintain.  

An example of the optimizer implicitly splitting the flow is creating multiple Pig relations from the same input using different criteria and the filter function.

state_info = load '/user/hduser/geography/*.csv' using PigStorage(',') as ( stateID:chararray, population:int, timezone:charray);
pst_states = filter state_info by timezone == 'PST';
mst_states = filter state_info by timezone == 'MST';
cst_states = filter state_info by timezone == 'CST';
est_states = filter state_info by timezone == 'EST';

The explicit approach is to use the split function.  That would look like this:

state_info = load '/user/hduser/geography/*.csv' using PigStorage(',') as ( stateID:chararray, population:int, timezone:charray);
split state_info into
     mst_states if timezone == 'MST',
     pst_states if timezone == 'PST',
     cst_states if timezone == 'CST',
     est_states if timezone == 'EST';

 

 

Posted in Uncategorized | Leave a comment

To copy or move: Implications of loading Hive managed table from HDFS versus local filesystem

When using the load function to populate a Hive table, it’s important to understand what Hive does with the actual data files when the input data resides on your local file system or on the HDFS file system.

For example, to load data from your local home directory into a Hive table:

hive> LOAD DATA LOCAL INPATH '/home/username1/weather/input' INTO TABLE weather_data;

You’ll actually see write in the output messages like:

Copying data from file:/home/hduser/weather_data/input
Copying file: file:/home/hduser/weather_data/input/weather.16.csv
Copying file: file:/home/hduser/weather_data/input/weather.86.csv
...
...
Copying file: file:/home/hduser/weather_data/input/weather.52.csv
Copying file: file:/home/hduser/weather_data/input/weather.37.csv
Loading data to table default.weather_data

Under the covers, Hive will actually copy the files found in /home/username1/weather into the HDFS directory associated with the table weather_data (e.g. /user/hive/warehouse/weather_data/). If you want to see what that directory is, run the following hive command:

hive> describe extended weather_data;

Look for the ‘location’ value.

If that data was already on the HDFS file system, however, Hive would employ a move and not a copy.  For example:

hduser@hadoop1:/home/hduser/$ hadoop dfs -ls /user/hduser/weather_data/ | wc -l
101
hive> load data inpath '/user/hduser/weather_data/' into table weather_data;

Now, let’s check the output of dfs -ls | wc -l

hduser@hadoop1:~/weather_data$ hadoop dfs -ls weather_data | wc -l
0

As you can see, the files were physically moved from /user/hduser/weather_data into the location associated with the Hive table.

Posted in hadoop, hive, scripting, Uncategorized | Tagged , , | Leave a comment

Hive’s collection data types

Hive offers several collection data types: struct, map and array. These data types don’t necessarily make a lot of sense if you are moving the data from the well-structured world of the RDBMS but if you are working directly with application generated data or data that is less-structured then this could be a great capability to have in your arsenal.

struct, like in most programming languages, allows you to define a structure with established columns and data types. For example, a column could be called address and be declared as:

address struct<street:string, city:string, state:string, zipcode:int>

When referring to these columns, you would reference it like address.street.

map is a little less structured; instead of predefining the sub-attributes of this column you define a key-value and declare the data type for each. For example, an acceptable map could be:

preferences map<pref_code string, pref_value string>

This gives you the flexibility to add really whatever you want – so long as the first value (key) is the right data type and the second value (actual value) matches also.

select preferences["email_offers"] from dim_customer;

Finally, array allows you to store n number of values of the same data type – and functionally speaking the same type of business object, too.  In other words, you wouldn’t use an array unless the objects represented the same type of information and using the same data type.  An example where an array could be used:

household_ages array[smallint]

You can put all this together into a single example to see how one might use this — again given the existing structure of the data.  You probably wouldn’t convert existing structured data into this type of format .

create table dim_customer
 (
     customer_id         bigint,
     customer_name    struct<fname:string, lname:string>,
     customer_addr    struct<street:string, city:string, state:string, zip:int>,
     household_ages    array<smallint>,
     email_prefs            map<string, boolean>
 )
 row format delimited 
 fields terminated by '|'     -- This is how each field is seperated
 collection items terminated by ','   -- this is how values in the struct, map and array are seperated
 map keys terminated by ':'  -- This is how the keys in map data type are seperated from their values
 lines terminated by '\n' stored as textfile; 

Your input data – using the delimiters above – would then look like this:

12345|John,Smith|123 Main St,New York, NY, 00000|45,40,17,13|weekly_update:true,special_clearance:true,birthday_greeting:false

And could be loaded with:

load data local inpath '/tmp/dim_customer'.dat' overwrite into table dim_customer;
Posted in hadoop, hive, scripting | Tagged , | Leave a comment

Passing parameters to Hive scripts

Like Pig and other scripting languages, Hive provides you with the ability to create parameterized scripts – greatly increasing the re-usability of the scripts.  To take advantage, write your Hive scripts like this:

select yearid, sum(HR)
from   batting_stats
where  teamid = '${hiveconf:TEAMID}' 
group  by yearid
order  by yearid desc;

Note that the restriction on teamid is ‘${hiveconf:TEAMID}’ rather than an actual value.  This is an instruction to read this variable’s value from the hiveconf namespace.  When you execute the script, you’ll run it as shown below:

hive -f batting.hive -hiveconf TEAMID='LAA'

If you define the parameter in the script but fail to specify a value at run-time, you won’t get any error like you would with Pig.  Instead, the restriction effectively becomes “where teamid = ””.  If you have blanks then you might get a result back; if not, you’ll go through all the necessary mechanics of executing the script sans the results.

Posted in hadoop, hive, scripting | Tagged , , | Leave a comment