如何将多个文件下载、解压并直接传输到一个 s3 存储桶中？

Question

我的问题如下：我想使用 url 下载托管在某处的数据集，将其解压缩并将文件（例如图像）上传到 s3 存储桶。数据集的示例可以是 CIFAR-100：https://www.cs.toronto.edu/~kriz/cifar.html and the dataset url would be https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz

请注意，在某些情况下，数据集很大，因此首先将其下载到我的本地计算机根本不是一种选择。我想过创建一个管道来尽可能简化它。以下命令适用于单个文件（例如单个图像）：

curl "url/single_image.tar.gz" | tar xvz | aws s3 cp - s3://my_bucket/single_image.jpg

但是如果压缩文件夹包含例如多个图像，上面的命令不再有效，因为它需要指定目标文件名和扩展名。

这个问题最简单的解决方法是什么？

Answer 1

将 gnu tar 与 --to-command 选项一起使用，它允许您：

Extract files and pipe their contents to the standard input of command. When this option is used, instead of creating the files specified, tar invokes command and pipes the contents of the files to its standard output.

它甚至支持以下内容：

The command can obtain the information about the file it processes from the following environment variables:

TAR_FILENAME The name of the file.

以下命令应该可以满足您的要求：

curl https://xxxxx/test.tar | tar -xz --to-command='aws s3 cp - s3://bucket/$TAR_FILENAME'

如何将多个文件下载、解压并直接传输到一个 s3 存储桶中？

How to download, decompress and transfer multiple files directly into an s3 bucket?

linux

etl

amazon-s3

amazon-web-services