distcp from Hadoop cluster to AWS S3 bucket

Background

Access key IDs beginning with AKIA are long-term credentials for an IAM user or the AWS account root user. Access key IDs beginning with ASIA are temporary credentials that are created using STS operations. (details)

IAM role (which in my case) has, 
1. Access key (AISAxxxx format)
2. Secret key
3. Security token

Where as IAM user, has 2 values:
1. Access key
2. Secret key

Hadoop commands

# hadoop commands with temporary credentials (ASIAxxxx format)
$ hadoop fs \
-Dfs.s3a.access.key=ASIAXXXXXXXXX \
-Dfs.s3a.secret.key="xxxxxxxxxxxxxxxxx" \
-Dfs.s3a.session.token="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
-Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider \
-ls s3a://my_bucket_name/hadoop_files


$ hadoop distcp \
-Dfs.s3a.access.key=ASIAXXXXXXXXX \
-Dfs.s3a.secret.key="xxxxxxxxxxxxxxxxx" \
-Dfs.s3a.session.token="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
-Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider \
hdfs:///folder/file  s3a://my_bucket_name/hadoop_files/

You can extract the credentials from ~/.aws/credentials file in case you have the awscli installed.

References

  1. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
  2. https://stackoverflow.com/questions/50242843/how-do-i-use-an-aws-sessiontoken-to-read-from-s3-in-pyspark
  3. https://docs.aws.amazon.com/STS/latest/APIReference/API_GetAccessKeyInfo.html

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *