Huge Data Transfer between AWS S3 glacier cross account buckets

8 min readJan 3, 2022

SCOPE

In this document it is explained about the transfer of huge data from AWS s3 glacier bucket to cross account AWS S3 glacier bucket. The use case is to transfer 50TB glacier data to cross account bucket. As the source data is in glacier and many of the files are more than 5GB size, it is decided to follow below method.
1. Restore data from glacier using S3 Batch Operations
2. Use Data sync to transfer the restored data to cross account

RESTORE DATA FROM GLACIER

As the source bucket is in cold storage(glacier), it is required to restore it first. Follow below steps to restore the data from S3 Glacier.

1. At source identify bucket to restore.

2. Create a temporary bucket to store the inventory for step1. You can skip this step if existing bucket is used to store inventory.

3. Create an inventory file for source bucket.

Amazon S3 -> {{source bucket name}} -> Management -> Inventory configurations -> Create Inventory Configuration
 • Provide name for Inventory configuration name
 • If required provide a prefix for Inventory scope else leave it blank
 • Select current version only for Object versions
 • Select inventory output Destination bucket
 • Select the frequency
 • Select the output format as CSV
 • Status Enable
 • Disable server side encryption
 • Select Additional fields – optional

4. Wait till the inventory is created, may take some hours.

5. Once inventory file is created, verify and validate inventory file against the s3 source bucket.

6. To create and run batch job, the S3 Batch operations service should have permissions on below:

 — Source S3 Bucket
 — Inventory Bucket where the inventory report is stored
 — On a S3 bucket where you would be storing the batch operations status report

Create a new IAM role and attach below policy to the new role:

S3 Batch Operations Policy:

{
 Version": "2012-10-17",
 Statement": [
  {
   "Effect": "Allow",
   "Action": [
    "s3:RestoreObject"
   ],
   "Resource": [
    "arn:aws:s3:::{{Source s3 bucket}}/*"
   ]
  },
  {
   "Effect": "Allow",
   "Action": [
    "s3:GetObject",
    "s3:GetObjectVersion",
    "s3:GetBucketLocation"
   ],
   "Resource": [
    "arn:aws:s3:::{{Inventory location path}}/*"
   ]
  },
  {
   "Effect": "Allow",
   "Action": [
    "s3:PutObject",
    "s3:GetBucketLocation"
   ],
   "Resource": [
    "arn:aws:s3:::{{Restore completion status report s3 bucket}}/*"
   ]
  }
 ]
}Trust policy for s3 batch:
{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Principal": {
    "Service": "batchoperations.s3.amazonaws.com"
   },
   "Action": "sts:AssumeRole"
  }
 ]
}

7. After the inventory report is verified then create a batch job.

From the navigation pane, choose Batch operations -> Choose Create job.
 • For Region, select the AWS Region where you want to create the job.
 • Under Choose manifest, enter the following:
    - For Manifest format, select CSV as your file format.
    - For Path to manifest object, enter the S3 path to the manifest   file. (For example, the S3 path looks like this:  "s3://awsexamplebucket/manifest.csv".)
 • Choose Next.
 • Under Choose operation, enter the following:
    - For Operation, select Restore.
    - For Restore source, select Glacier or Glacier Deep Archive.
    - For Number of days that the restored copy is available, enter the number of days for your use case.
    - For Restore tier, select either Bulk retrieval.
 • Choose Next.
 • Under Configure additional options, enter the following:
    - For Description, enter description which is optional.
    - For Priority, enter any positive number.
    - For Generate completion report, you can choose to keep this option selected.
    - For Completion report scope, select Failed tasks only or All tasks depending on your use case.
    - For Path to completion report destination, enter the path that you want the report to be sent to.
    - For Permission, select Choose from existing IAM roles. Then, select the IAM role(Created at step 6) that has permissions to initiate a restore and has a trust policy with S3 batch operations.
 • Choose Next.
 • On the Review page, review the details of the job. Then, choose Create job.

8. After job is created, the job’s status changes from New to Preparing to Awaiting your confirmation. To run the job, select the job and then choose Confirm and run. The job doesn’t run until you confirm it.

9. If you selected Generate completion report, then you can review the report after the job completes. You can find the report at the Path to completion report destination that you specified.

10. After running the job, check the status of restore using below command.

aws s3api head-object --bucket {{Source bucket name}}  --key  {{file name with path like -> Vijay_Test/Nov-30/wsl_update_x64.msi}}  --profile {AWSAccountProfile}  --region {region name}
  {
   "AcceptRanges": "bytes",
"Restore": "ongoing-request=\"false\", expiry-date=\"Mon, 03 Jan 2022 00:00:00 GMT\"",
   …
  }

In above output it should show ongoing-request as false. That indicate the restore is done.

DATASYNC TO TRANSFER DATA

Once the restore is done successfully, configure the data sync for cross account s3 buckets data transfer. Data Sync can be configured at either source side or target side. In this document it is setup at source side.

1. Log in to the source account and create an IAM role for the AWS DataSync service to access objects in the destination S3 bucket. Attach below policy to the newly created role.

Either give AmazonS3FullAccess policy or add below policy to the role

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Action": [
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:ListBucketMultipartUploads"
   ],
   "Effect": "Allow",
   "Resource": "arn:aws:s3:::{{DESTINATIONBUCKET}}"
  },
  {
   "Action": [
    "s3:AbortMultipartUpload",
    "s3:DeleteObject",
    "s3:GetObject",
    "s3:ListMultipartUploadParts",
    "s3:PutObjectTagging",
    "s3:GetObjectTagging",
    "s3:PutObject"
   ],
   "Effect": "Allow",
   "Resource": "arn:aws:s3:::{{DESTINATIONBUCKET}}/*"
  }
 ]
}
   
Attach below trust policy to the role:
{
 "Version": "2012-10-17",
 "Statement": [
  {
  "Effect": "Allow",
  "Principal": {
   "Service": "datasync.amazonaws.com"
  },
  "Action": "sts:AssumeRole"
  }
 ]
}

2. Attach below S3 bucket policy(not an IAM policy) to the destination s3 bucket. This provides an access to destination S3 bucket for Source IAM role(DataSync role) and source account user.

Go to S3 console -> select destination bucket -> Permissions -> S3 bucket policy -> Edit

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Sid": "BucketPolicyForDataSync",
   "Effect": "Allow",
   "Principal": {
    "AWS": [           "arn:aws:iam::{{Source Acct Num}}:role/{{SOURCEDATASYNCROLE}}",
     "arn:aws:iam::{{Source Acct Num}}:root"
    ]
   },
   "Action": [
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:ListBucketMultipartUploads",
    "s3:AbortMultipartUpload",
    "s3:DeleteObject",
    "s3:GetObject",
    "s3:ListMultipartUploadParts",
    "s3:PutObject",
    "s3:GetObjectTagging",
    "s3:PutObjectTagging"
   ],
   "Resource": [
    "arn:aws:s3:::{{DESTINATIONBUCKET}}",
    "arn:aws:s3:::{{DESTINATIONBUCKET}}/*"
   ]
  }
 ]
}

In above, in principal section, you can replace the “root” ARN with specific ARN from below commands output if required.

$aws sts get-caller-identity --profile {{aws profile name}} --region {{region name}}
{
 "UserId": "AIDAVOHWA6******",
 "Account": "{{Aws account num}}",
 "Arn": "arn:aws:iam::{{source acct num}}:user/***@***.com"
}

Make sure above specific user who is doing the DataSync operation has below privileges:

Either,
• AmazonS3FullAcess
• AWSDatasyncFullAcess
• CloudWatchLogsFullAccess

or as per least privilege access principle,
• s3:GetBucketLocation
• s3:ListBucket
• s3:ListBucketMultipartUploads
• s3:AbortMultipartUpload
• s3:DeleteObject
• s3:GetObject
• s3:ListMultipartUploadParts
• s3:PutObject
• s3:GetObjectTagging
• s3:PutObjectTagging
• datasync:CreateLocationS3
• logs:DescribeLogGroups
• logs:DescribeResourcePolicies

3. Create a location for destination bucket at source. This cannot be done from console and can be done only through command line. Below configuration transfers the files directly to glacier storage. You can adjust s3-storage-class as per the UseCase.

$aws datasync create-location-s3 --s3-bucket-arn arn:aws:s3:::{{DESTINATIONBUCKET}}  --s3-storage-class GLACIER --region {{region name}} --s3-config "{\"BucketAccessRoleArn\":\"arn:aws:iam::{{aws account num}}:role/{{DATASYNC SOURCE IAM ROLE}}\"}" --profile {{profile name}}
 {
  "LocationArn": "arn:aws:datasync:{{region name}}:{{ aws account num}}:location/loc-035494e******"
 }

4. Create a location for source bucket at source.

Go to DataSync -> Select locations at left side pane -> Click create location 
 -> Location type "Amazon S3" 
  o select storage class(in this use case its glacier)
  o Select Folder prefix if required
  o IAM role - Auto generate it
 -> Click create location.

5. Create DataSync job at source to copy the data to destination.

At source, go to DataSync -> click “Create task” to initiate a data transfer with below specified parameters.

• Configure source location:
  - Choose an existing location
  - Select region 
  - Choose existing location, select the location created at step 4.
  - Click Next
• Configure Destination location:
  - Choose an existing location
  - Select region 
  - Choose existing location, select the location created at step 3.
  - Click Next
• Provide a Task Name
• Verify data - Select verify only the data transferred
• Set bandwidth limit - use available
• Queueing - enabled
• Data transfer configuration:
  - Data to Scan - entire source location 
  - Transfer all data 
  - Select keep deleted files 
  - Select overwrite files 
• Exclude Patterns
  - Provide any exclusion of files 
• Schedule 
  - Not scheduled
• Task Logging
  - Log all transferred objects and files 
  - Select default cloudwatch log group
• Review and create task

6. Run the task created at step 5 manually. The transfer depends on data, here, on an average it took around 570 MB/sec.

7. Verify the transferred files at destination.
- Verify Cloudwatch logs. If the log size is huge, you can export it to S3 and review. There is a limitation of log size of 256 KB, and if the transferred files are in thousands, then this is not correct way to verify.
- To verify further, create an inventory with size inclusion at destination and compare with the source inventory file.

COST and TIME

Reg. time, to transfer 50TB it took around 24 hrs. including the batch restore operation. Reg, cost, it looks like the Bulk restore is free and the datasync charge for 50TB is around 600$.

CLEANUP

1. Remove Batch Operations task
2. Remove source inventory configuration if required
3. Remove destination inventory configuration if required
4. Remove cloudwatch log groups if required
5. Remove Data Sync task
6. Remove Roles and Policies if required
7. Remove destination S3 bucket policy
8. Remove source and destination locations at DataSync
9. Remove permissions to the user provided for this operation