Cleanup hdfs directory having too many files and directories

At times some directories on hdfs has too many inodes (files and folders) and it is really hard to delete. Some instances also lead to out of memory (OOM) errors such as the following error,

INFO retry.RetryInvocationHandler: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: GC overhead limit exceeded, while invoking ClientNamenodeProtocolTranslatorPB.getListing over namenode-server.domain.tld/x.y.x.a:8020. Trying to failover immediately

 

Set of below shell commands has helped me pull out the list of files and delete them from hdfs,

export HADOOP_CLIENT_OPTS="-XX:-UseGCOverheadLimit -Xmx16000m"
hdfs dfs -ls /tmp/hive/hive | awk '/2018-[0-9][0-9]-/{print $8}' | paste -d ' ' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >/tmp/del_hive_01.txt
cat /tmp/del_hive_01.txt paste -d ' ' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > /tmp/del_hive_02.txt
cat /tmp/del_hive_01.txt | paste -d ' ' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > /tmp/del_hive_02.txt
awk -F'\t' '{print "hdfs dfs -rm -r -f -skipTrash ", $1 }' /tmp/del_hive_02.txt > /tmp/del_hive_03.txt
sh /tmp/del_hive_03.txt

export was required to increase mem, else it was giving GC error even to list the directory or delete a single file/dir based on wildcard.
paste was used to delete multiple directories at once, else it would have taken ages to clean up the dir.

You may change the awk filter pattern based on the files you are trying to cleanup.

 

HTH

 

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *