Pig workflow optimization: splitting data flows

Pig supports the concept of non-linear data flows, where you have a single input but multiple outputs.  Pig’s optimizer is smart enough to recognize when the same input is referenced multiple times and implicitly splits those data flows.  You can explicitly do it with the split function as shown below.  Personally, I prefer this approach because it seems slightly easier to maintain.  

An example of the optimizer implicitly splitting the flow is creating multiple Pig relations from the same input using different criteria and the filter function.

state_info = load '/user/hduser/geography/*.csv' using PigStorage(',') as ( stateID:chararray, population:int, timezone:charray);
pst_states = filter state_info by timezone == 'PST';
mst_states = filter state_info by timezone == 'MST';
cst_states = filter state_info by timezone == 'CST';
est_states = filter state_info by timezone == 'EST';

The explicit approach is to use the split function.  That would look like this:

state_info = load '/user/hduser/geography/*.csv' using PigStorage(',') as ( stateID:chararray, population:int, timezone:charray);
split state_info into
     mst_states if timezone == 'MST',
     pst_states if timezone == 'PST',
     cst_states if timezone == 'CST',
     est_states if timezone == 'EST';

 

 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s