Hive’s collection data types

Hive offers several collection data types: struct, map and array. These data types don’t necessarily make a lot of sense if you are moving the data from the well-structured world of the RDBMS but if you are working directly with application generated data or data that is less-structured then this could be a great capability to have in your arsenal.

struct, like in most programming languages, allows you to define a structure with established columns and data types. For example, a column could be called address and be declared as:

address struct<street:string, city:string, state:string, zipcode:int>

When referring to these columns, you would reference it like address.street.

map is a little less structured; instead of predefining the sub-attributes of this column you define a key-value and declare the data type for each. For example, an acceptable map could be:

preferences map<pref_code string, pref_value string>

This gives you the flexibility to add really whatever you want – so long as the first value (key) is the right data type and the second value (actual value) matches also.

select preferences["email_offers"] from dim_customer;

Finally, array allows you to store n number of values of the same data type – and functionally speaking the same type of business object, too.  In other words, you wouldn’t use an array unless the objects represented the same type of information and using the same data type.  An example where an array could be used:

household_ages array[smallint]

You can put all this together into a single example to see how one might use this — again given the existing structure of the data.  You probably wouldn’t convert existing structured data into this type of format .

create table dim_customer
 (
     customer_id         bigint,
     customer_name    struct<fname:string, lname:string>,
     customer_addr    struct<street:string, city:string, state:string, zip:int>,
     household_ages    array<smallint>,
     email_prefs            map<string, boolean>
 )
 row format delimited 
 fields terminated by '|'     -- This is how each field is seperated
 collection items terminated by ','   -- this is how values in the struct, map and array are seperated
 map keys terminated by ':'  -- This is how the keys in map data type are seperated from their values
 lines terminated by '\n' stored as textfile; 

Your input data – using the delimiters above – would then look like this:

12345|John,Smith|123 Main St,New York, NY, 00000|45,40,17,13|weekly_update:true,special_clearance:true,birthday_greeting:false

And could be loaded with:

load data local inpath '/tmp/dim_customer'.dat' overwrite into table dim_customer;
This entry was posted in hadoop, hive, scripting and tagged , . Bookmark the permalink.

Leave a comment