Best practices for Namenode and Datanode restarts

Problems

Following are some problems we might come across while working in a large setup of hadoop clusters,

  1. Namenode restarts taking long time (http://nn-host:50070/dfshealth.html#tab-startup-progress)
  2. Namenode startup goes to safemode for a long time after restart

 

Best practices for Namenode & restarts

DO NOT restart all services at once. Instead do the following in order,

  1. Go to standby namenode first, and restart it
  2. Then restart the active namenode
  3. Do a rolling restart for datanodes. Increase the duration between restart jobs to be 3-4 minutes and restart 2 datanodes at a time. It is safer that was as running jobs should not get impacted. At least one copy is alive if replication factor is 3x.

 

Faster namenode startup

Most of the times startup times are long if there are large number of edit logs to load for a namenode. It is recommended to save Namespaces once in a while to rebuild fsimage once in a while (once a month or so). Make sure no jobs are running

# For all namenodes
hdfs dfsadmin -safemode enter 
hdfs dfsadmin -saveNamespace 
hdfs dfsadmin -safemode leave

# For specific namenode in case of HA (start with Standby first. Port is usually 8020 or 9000)
hdfs dfsadmin -fs hdfs://<namenode-host>:<port> -safemode enter
hdfs dfsadmin -fs hdfs://<namenode-host>:<port> -saveNamespace 
hdfs dfsadmin -fs hdfs://<namenode-host>:<port> -safemode leave

 

Exiting namenode safemode manually

DO NOT try to leave or exit the namenode manually using the command below 😀

hdfs dfsadmin -safemode leave

This could result in missing blocks or under replicated block for a namenode. Instead go to the namenode UI and check for the datanodes that has not reported the blocks to namenode and restart them individually. An easy way to find out those datanodes is from the number of blocks reported in the UI. They will be the once having oddly low number of blocks.

 

Switching Namenodes

Use the below command instead of bouncing the active namenode

hdfs dfsadmin -failover nn2(standby) nn1(active)

 

HTH

 

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *