Creating Hive tables on compressed files
Stuck with creating Hive tables on compressed files?
Well the documentation on apache.org suggests that Hive natively supports compressed file – https://cwiki.apache.org/confluence/display/Hive/CompressedStorage
Lets try that out.
Store a snappy compressed file on HDFS. … thinking, I do not have such file… Wait! What?
Okay so lets create one.
Upload a text file on hdfs. Create folder /tmp/input if it does not exist – $ hdfs dfs -mkdir /tmp/input
$ hdfs dfs -put sample.txt /tmp/input
sample.txt content
1,a 2,b 3,c 4,d 5,e
Use following pig scripts to compress the file with snappy compression codec.
grunt> SET mapred.output.compress 'true'; grunt> SET mapred.map.output.compress.codec org.apache.hadoop.io.compress.SnappyCodec; grunt> data = LOAD '/tmp/input' USING PigStorage(); grunt> store data into '/tmp/output_snappy' using PigStorage();
/tmp/output_snappy directory now contains snappy compressed file
…thinking, how do I know? .. Wait! What?
Eh, Lets check the contents
Run the below command to fetch and print contents
$ hdfs dfs -cat /tmp/output_snappy/part-m-00000.deflate | head -1
Garbled? Try this
$ hdfs dfs -text /tmp/output_snappy/part-m-00000.deflate | head -1
Looks good? Okay!
Lets build our Hive table on top of /tmp/ouput_snappy as we normally would
create external table sample_snappy(id int, text string) row format delimited fields terminated by ',' location '/tmp/output_snappy';
Tempted to test?
0: jdbc:hive2://www.robin.eu.org:10000> select * from sample_snappy; +-------------------+---------------------+--+ | sample_snappy.id | sample_snappy.name | +-------------------+---------------------+--+ | 1 | a | | 2 | b | | 3 | c | | 4 | d | | 5 | e | +-------------------+---------------------+--+
Enjoy 😉
All in one commands and scripts
[ $ shell ] echo -e '1,a\n2,b\n3,c\n4,d\n5,e' > sample.txt hdfs dfs -mkdir /tmp/input hdfs dfs -put sample.txt /tmp/input [ grunt> ] SET mapred.output.compress 'true'; SET mapred.map.output.compress.codec org.apache.hadoop.io.compress.SnappyCodec; data = LOAD '/tmp/input' USING PigStorage(); store data into '/tmp/output_snappy' using PigStorage(); [ $ shell ] hdfs dfs -cat /tmp/output_snappy/part-m-00000.deflate | head -1 hdfs dfs -text /tmp/output_snappy/part-m-00000.deflate | head -1 [ beeline > ] create external table sample_snappy(id int, text string) row format delimited fields terminated by ',' location '/tmp/output_snappy'; select * from sample_snappy;