Fastest way of compressing file(s) in Hadoop

Compressing files in hadoop

Okay, well.. It may or may not be the fastest. Email me if you find a better alternate 😉

Short background,

  1. The technique uses simple Pig script
  2. Make Pig use tez engine (set the queue name appropriately)
  3. You can change codec in Pig script

 

Da Pig script

[compress-snappy.pig]

/*
 * compress-snappy.pig: Pig script to compress a directory
 *
 * input:       IN_DIR: hdfs input directory to compress
 * output:     OUT_DIR: hdfs output directory
 *
 */

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
set exectype tez;
set tez.queue.name DBA;
set mapred.job.queue.name DBA;

--comma seperated list of hdfs directories to compress
input0 = LOAD '$IN_DIR' USING PigStorage();

--single output directory
STORE input0 INTO '$OUT_DIR' USING PigStorage();

Execute using,

$ pig -p IN_DIR=/dir/large/files/dt=2017-05-01 -p OUT_DIR=/tmp/compression/test/dt=2017-05-01 -f compress-snappy.pig -x tez

 

The output directory may contain multiple .snappy files. They can be safely merged using command,

hdfs dfs -cat /tmp/compression/test/dt=2017-05-01/part* | hdfs dfs -put - /data/final/dataset-01/dt=2017-05-01/merged-2017-05-01.snappy

 

You like it? You share it 😉

 

References

  1. https://stackoverflow.com/questions/7153087/hadoop-compress-file-in-hdfs

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *