AWS S3 从远程位置复制到存储桶

AWS S3 copy to bucket from remote location

有一个 large dataset on a public server (~0.5TB, multi-part here),我想将其复制到我自己的 s3 存储桶中。似乎 aws s3 cp 仅适用于本地文件或基于 S3 存储桶的文件?

如何将该文件(单个或多个部分)复制到 S3 中?我可以使用 AWS CLI 还是需要其他东西?

无法从远程位置直接上传到 S3。但是您可以将远程文件的内容流式传输到您的机器,然后再传输到 S3。这意味着您将下载全部 0.5TB 的数据,但您的计算机一次只会将一小部分数据保存在内存中(它也不会持久保存到光盘中)。这是 javascript 中的一个简单实现:

const request = require('request')
const async = require('async')
const AWS = require('aws-sdk')
const s3 = new AWS.S3()
const Bucket = 'nyu_depth_v2'
const baseUrl = 'http://horatio.cs.nyu.edu/mit/silberman/nyu_depth_v2/'
const parallelLimit = 5
const parts = [
  'basements.zip',
  'bathrooms_part1.zip',
  'bathrooms_part2.zip',
  'bathrooms_part3.zip',
  'bathrooms_part4.zip',
  'bedrooms_part1.zip',
  'bedrooms_part2.zip',
  'bedrooms_part3.zip',
  'bedrooms_part4.zip',
  'bedrooms_part5.zip',
  'bedrooms_part6.zip',
  'bedrooms_part7.zip',
  'bookstore_part1.zip',
  'bookstore_part2.zip',
  'bookstore_part3.zip',
  'cafe.zip',
  'classrooms.zip',
  'dining_rooms_part1.zip',
  'dining_rooms_part2.zip',
  'furniture_stores.zip',
  'home_offices.zip',
  'kitchens_part1.zip',
  'kitchens_part2.zip',
  'kitchens_part3.zip',
  'libraries.zip',
  'living_rooms_part1.zip',
  'living_rooms_part2.zip',
  'living_rooms_part3.zip',
  'living_rooms_part4.zip',
  'misc_part1.zip',
  'misc_part2.zip',
  'office_kitchens.zip',
  'offices_part1.zip',
  'offices_part2.zip',
  'playrooms.zip',
  'reception_rooms.zip',
  'studies.zip',
  'study_rooms.zip'
]

async.eachLimit(parts, parallelLimit, (Key, cb) => {
  s3.upload({
    Key,
    Bucket,
    Body: request(baseUrl + Key)
  }, cb)
}, (err) => {
  if (err) console.error(err)
  else console.log('Done')
})