使用 Amazon 的 Date Pipeline 备份 S3 bucket——如何跳过现有文件并避免不必要的覆盖?

Using Amazon's Date Pipeline to backup S3 bucket -- how to skip existing files and avoid unnecessary overwriting?

我正在使用 Amazon 的 Date Pipeline 将 S3 存储桶复制到另一个存储桶。这是一个非常简单的设置,每晚 运行s。但是,每个后续的 运行 都会一遍又一遍地复制相同的文件——我宁愿它只是跳过现有文件并只复制新文件,因为这个备份将来会变得非常大。有办法吗??

查看 this thread,似乎无法与默认 CopyActivity 同步:

You can definitely use Data Pipeline to copy one S3 directory to another, with the caveat that, if you use the CopyActivity, it'll be a fully copy, not an rsync. So if you're operating on a large number of files where only a small fraction have changed, the CopyActivity wouldn't be the most efficient way to do it.

You could also write your own logic to perform the diff and then only sync that, and use the CommandRunnerActivity to schedule and manage it.

我认为它们实际上是指 ShellCommandActivity,它允许您将 shell 命令安排到 运行。

我不能给你一个确切的配置示例,但这里是命令示例,你可以 运行 使用常规 cron 作业来同步两个存储桶:aws s3 sync s3://source_bucket s3://target_bucket.

应该可以 运行 使用 ShellCommandActivity。还要检查 , and the comments to the answer here.

更新:@trevorhinesley 的评论和最终解决方案(管道启动的默认实例使用一些没有 sync 命令的旧 aws cli):

对于遇到此问题的任何人,我必须启动一个 EC2 实例,然后复制它使用的 AMI ID(当您在实例中 select 它时,它位于实例列表下方的信息中EC2 下的菜单)。我在数据管道中使用了那个图像 ID,它修复了它!

您可以这样做以确保更新 aws,首先是 activity 运行,接下来是 运行.

中使用的参数值
{
  "name": "CliActivity",
  "id": "CliActivity",
  "runsOn": {
    "ref": "Ec2Instance"
  },
  "type": "ShellCommandActivity",
  "command": "(sudo yum -y update aws-cli) && (#{myAWSCLICmd})"
},

"parameters": [
{
  "watermark": "aws [options] <command> <subcommand> [parameters]",
  "description": "AWS CLI command",
  "id": "myAWSCLICmd",
  "type": "String"
}],
  "values": {
"myAWSCLICmd": "aws s3 sync s3://source s3://target"}