Adding compression codec to Hortonworks data platform

Lately I tried installing xz/lzma codec on my local VM setup. The compression ratios are pretty awesome. Won’t do a benchmark here, try it out yourself 😉

 

Steps

  1. Download codec JAR – https://github.com/yongtang/hadoop-xz or https://mvnrepository.com/artifact/io.sensesecure/hadoop-xz
  2. Copy downloaded JAR to HDPs’ libs folder
    find /usr/hdp/ -name *snappy*jar |  xargs -L1 dirname | xargs -L1 sudo cp ~/hadoop-xz-1.4.jar
  3. Setup compression in HDFS config using Ambari
    Ambari -> HDFS -> Configs -> Advanced core-site -> io.compression.codecs -> add 'io.sensesecure.hadoop.xz.XZCodec'
  4. Making hive to use new JAR
    Create below folder on server running hive server 2 with hive as owner and copy hadoop-xz jar file.

    mkdir /usr/hdp/<version>/hive/auxlib
    chown hive -R /usr/hdp/<version>/hive/auxlib
  5. Restart HiveServer2

 

Testing with Hive

create a big sample file in local dir /tmp/sample.txt

Operations in hive

create table orig_sample(val string);
!sh hdfs dfs -put /tmp/sample.txt /tmp;
LOAD DATA INPATH '/tmp/sample.txt' OVERWRITE INTO TABLE orig_sample;

-- test lzma
set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;
set mapreduce.output.fileoutputformat.compress.codec=io.sensesecure.hadoop.xz.XZCodec;

drop table test_table_lzma;
CREATE TABLE test_table_lzma
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," 
LINES TERMINATED BY "\n" 
STORED AS TEXTFILE 
LOCATION "/tmp/test_table_lzma" as 
select * from orig_sample;

Checking results

hdfs dfs -du -s -h /tmp/sample.txt
hdfs dfs -du -s -h /tmp/test_table_lzma

 

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *