Hive offers several collection data types: struct, map and array. These data types don’t necessarily make a lot of sense if you are moving the data from the well-structured world of the RDBMS but if you are working directly with application generated data or data that is less-structured then this could be a great capability to have in your arsenal.
struct, like in most programming languages, allows you to define a structure with established columns and data types. For example, a column could be called address and be declared as:
address struct<street:string, city:string, state:string, zipcode:int>
When referring to these columns, you would reference it like address.street.
map is a little less structured; instead of predefining the sub-attributes of this column you define a key-value and declare the data type for each. For example, an acceptable map could be:
preferences map<pref_code string, pref_value string>
This gives you the flexibility to add really whatever you want – so long as the first value (key) is the right data type and the second value (actual value) matches also.
select preferences["email_offers"] from dim_customer;
Finally, array allows you to store n number of values of the same data type – and functionally speaking the same type of business object, too. In other words, you wouldn’t use an array unless the objects represented the same type of information and using the same data type. An example where an array could be used:
household_ages array[smallint]
You can put all this together into a single example to see how one might use this — again given the existing structure of the data. You probably wouldn’t convert existing structured data into this type of format .
create table dim_customer ( customer_id bigint, customer_name struct<fname:string, lname:string>, customer_addr struct<street:string, city:string, state:string, zip:int>, household_ages array<smallint>, email_prefs map<string, boolean> ) row format delimited fields terminated by '|' -- This is how each field is seperated collection items terminated by ',' -- this is how values in the struct, map and array are seperated map keys terminated by ':' -- This is how the keys in map data type are seperated from their values lines terminated by '\n' stored as textfile;
Your input data – using the delimiters above – would then look like this:
12345|John,Smith|123 Main St,New York, NY, 00000|45,40,17,13|weekly_update:true,special_clearance:true,birthday_greeting:false
And could be loaded with:
load data local inpath '/tmp/dim_customer'.dat' overwrite into table dim_customer;