When using the load function to populate a Hive table, it’s important to understand what Hive does with the actual data files when the input data resides on your local file system or on the HDFS file system.
For example, to load data from your local home directory into a Hive table:
hive> LOAD DATA LOCAL INPATH '/home/username1/weather/input' INTO TABLE weather_data;
You’ll actually see write in the output messages like:
Copying data from file:/home/hduser/weather_data/input Copying file: file:/home/hduser/weather_data/input/weather.16.csv Copying file: file:/home/hduser/weather_data/input/weather.86.csv ... ... Copying file: file:/home/hduser/weather_data/input/weather.52.csv Copying file: file:/home/hduser/weather_data/input/weather.37.csv Loading data to table default.weather_data
Under the covers, Hive will actually copy the files found in /home/username1/weather into the HDFS directory associated with the table weather_data (e.g. /user/hive/warehouse/weather_data/). If you want to see what that directory is, run the following hive command:
hive> describe extended weather_data;
Look for the ‘location’ value.
If that data was already on the HDFS file system, however, Hive would employ a move and not a copy. For example:
hduser@hadoop1:/home/hduser/$ hadoop dfs -ls /user/hduser/weather_data/ | wc -l 101 hive> load data inpath '/user/hduser/weather_data/' into table weather_data;
Now, let’s check the output of dfs -ls | wc -l
hduser@hadoop1:~/weather_data$ hadoop dfs -ls weather_data | wc -l 0
As you can see, the files were physically moved from /user/hduser/weather_data into the location associated with the Hive table.