Category Archives: hadoop

Securing (and sharing) password information in Sqoop jobs

Sqoop is a utility that allows you to move data from a relational database system to an HDFS file system (or export from Hadoop to RDBMS!).  One of the things to keep in mind as you start building Sqoop jobs … Continue reading

Posted in General, hadoop, scripting, sqoop | Tagged , , | Leave a comment

To copy or move: Implications of loading Hive managed table from HDFS versus local filesystem

When using the load function to populate a Hive table, it’s important to understand what Hive does with the actual data files when the input data resides on your local file system or on the HDFS file system. For example, … Continue reading

Posted in hadoop, hive, scripting, Uncategorized | Tagged , , | Leave a comment

Hive’s collection data types

Hive offers several collection data types: struct, map and array. These data types don’t necessarily make a lot of sense if you are moving the data from the well-structured world of the RDBMS but if you are working directly with … Continue reading

Posted in hadoop, hive, scripting | Tagged , | Leave a comment

Passing parameters to Hive scripts

Like Pig and other scripting languages, Hive provides you with the ability to create parameterized scripts – greatly increasing the re-usability of the scripts.  To take advantage, write your Hive scripts like this: select yearid, sum(HR) from   batting_stats where  teamid … Continue reading

Posted in hadoop, hive, scripting | Tagged , , | Leave a comment

Passing parameters to Pig scripts

The Pig scripting language Pig Latin allows for parameter substitution at run-time.  Like any script, the ability to define parameters makes it far easier to share code with other users.  To do this in Pig Latin, you simply modify your … Continue reading

Posted in hadoop, pig, scripting | Tagged , | Leave a comment

Could not infer the matching function for org.apache.pig.builtin.SUM (or any function for that matter)

Pig – the language – may be like Pig – the animal – when it comes to ingesting data (not very picky), but syntax certainly does matter.  I learned this tonight while experimenting with Pig.  My script was pretty simple: … Continue reading

Posted in hadoop, pig, scripting | Leave a comment

Configuring pig to work with a remote Hadoop cluster

1. First, download a stable release of Pig from here. 2. As root (or some other privileged user), untar the pig tarball to /usr/local; this will create a sub-directory like /usr/local/pig.0.11.1. 3. Create a symbolic link (to make things easier) … Continue reading

Posted in hadoop, pig, scripting, Uncategorized | Tagged , , | 1 Comment