Adding compression codec to Hortonworks data platform
Lately I tried installing xz/lzma codec on my local VM setup. The compression ratios are pretty awesome. Won’t do a benchmark here, try it out yourself 😉
Steps
- Download codec JAR – https://github.com/yongtang/hadoop-xz or https://mvnrepository.com/artifact/io.sensesecure/hadoop-xz
- Copy downloaded JAR to HDPs’Â libs folder
find /usr/hdp/ -name *snappy*jar | xargs -L1 dirname | xargs -L1 sudo cp ~/hadoop-xz-1.4.jar
- Setup compression in HDFS config using Ambari
Ambari -> HDFS -> Configs -> Advanced core-site -> io.compression.codecs -> add 'io.sensesecure.hadoop.xz.XZCodec'
- Making hive to use new JAR
Create below folder on server running hive server 2 with hive as owner and copy hadoop-xz jar file.mkdir /usr/hdp/<version>/hive/auxlib chown hive -R /usr/hdp/<version>/hive/auxlib
- Restart HiveServer2
Testing with Hive
create a big sample file in local dir /tmp/sample.txt
Operations in hive
create table orig_sample(val string); !sh hdfs dfs -put /tmp/sample.txt /tmp; LOAD DATA INPATH '/tmp/sample.txt' OVERWRITE INTO TABLE orig_sample; -- test lzma set hive.exec.compress.output=true; set io.seqfile.compression.type=BLOCK; set mapreduce.output.fileoutputformat.compress.codec=io.sensesecure.hadoop.xz.XZCodec; drop table test_table_lzma; CREATE TABLE test_table_lzma ROW FORMAT DELIMITED FIELDS TERMINATED BY "," LINES TERMINATED BY "\n" STORED AS TEXTFILE LOCATION "/tmp/test_table_lzma" as select * from orig_sample;
Checking results
hdfs dfs -du -s -h /tmp/sample.txt hdfs dfs -du -s -h /tmp/test_table_lzma