如何从 S3 中的 zip 存档中提取文件

Question

我在 S3 的某个位置上传了一个 zip 存档（比如 /foo/bar.zip）我想提取 bar.zip 中的值并将其放在 /foo 下，而无需下载或重新上传提取的文件。我该怎么做，以便 S3 被视为非常像文件系统

Answer 1

S3 的设计并不是为了允许这样做；通常您必须下载文件、处理它并上传提取的文件。

但是，可能有几个选项：

您可以使用 s3fs 和 FUSE（参见 article and github site）将 S3 存储桶挂载为本地文件系统。这仍然需要下载和上传文件，但它将这些操作隐藏在文件系统接口后面。
如果您主要关心的是避免将数据从 AWS 下载到本地计算机，那么您当然可以将数据下载到远程 EC2 instance 并在那里完成工作，使用或没有 s3fs。这将数据保存在亚马逊数据中心内。
您可以使用 AWS Lambda.

您需要创建、打包和上传一个用 node.js 编写的小程序来访问、解压缩和上传文件。此处理将在幕后的 AWS 基础设施上进行，因此您无需将任何文件下载到您自己的计算机上。见 FAQs.

最后，您需要找到一种方法来触发此代码 - 通常，在 Lambda 中，这会通过将 zip 文件上传到 S3 来自动触发。如果文件已经存在，您可能需要通过 AWS API 提供的 invoke-async 命令手动触发它。请参阅 AWS Lambda walkthroughs and API docs.

但是，这是避免下载的一种非常巧妙的方法，可能只有在您需要处理大量 zip 文件时才值得这样做！另请注意（截至 2018 年 10 月）Lambda 函数被限制为 15 分钟 maximum duration (default timeout 是 3 秒），因此如果您的文件非常大，可能运行超时 - 但从头开始 space 在 /tmp 中限制为 500MB，您的文件大小也受到限制。

Answer 2

如果将数据保存在 AWS 中是目标，您可以使用 AWS Lambda 来：

连接到 S3（我通过 S3 的触发器连接 Lambda 函数）
从 S3 复制数据
打开存档并解压（无需写入磁盘）
对数据做点什么

如果函数是通过触发器启动的，Lambda 会建议您将内容放在单独的 S3 位置，以避免意外循环。要打开存档，处理它，然后 return 您可以执行以下操作的内容。

import csv, json
import os
import urllib.parse
import boto3
from zipfile import ZipFile
import io

s3 = boto3.client("s3")

def extract_zip(input_zip, file_name):
    contents = input_zip.read()
    input_zip = ZipFile(io.BytesIO(contents))
    return {name: input_zip.read(name) for name in input_zip.namelist()}
    
def lambda_handler(event, context):
    print("Received event: " + json.dumps(event, indent=2))
    # Get the object from the event and show its content type
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = urllib.parse.unquote_plus(
        event["Records"][0]["s3"]["object"]["key"], encoding="utf-8"
    )
    try:
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')

        response = s3.get_object(Bucket=bucket, Key=key)
        # This example assumes the file to process shares the archive's name
        file_name = key.split(".")[0] + ".csv"
        print(f"Attempting to open {key} and read {file_name}")
        print("CONTENT TYPE: " + response["ContentType"])
        data = []
        contents = extract_zip(response["Body"], file_name)
        for k, v in contents.items():
            print(v)
            reader = csv.reader(io.StringIO(v.decode('utf-8')), delimiter=',')
            for row in reader:
                data.append(row)
        return {
            "statusCode": 200,
            "body": data
        }

    except Exception as e:
        print(e)
        print(
            "Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.".format(
                key, bucket
            )
        )
        raise e

上面的代码通过response['Body']访问文件内容，其中response是由S3触发的事件。响应主体将是 StreamingBody 对象的一个实例，该对象是 file like object with a few convenience functions. Use the read() method, passing an amt argument if you are processing large files or files of unknown sizes. Working on an archive in memory requires a few extra steps. You will need to process the contents correctly, so wrap it in a BytesIO object and open it with the standard library's ZipFile, documentation here。将数据传递给 ZipFile 后，您可以对内容调用 read()。您需要从这里为您的特定用例弄清楚要做什么。如果档案中有多个文件，您将需要处理每个文件的逻辑。我的示例假设您有一个或几个小的 csv 文件要处理，并且 return 是一个以文件名作为键并将值设置为文件内容的字典。

我已经包括下一步读取 CSV 文件和 returning 数据以及响应中的状态代码 200。请记住，您的需求可能会有所不同。此示例将数据包装在 StringIO 对象中，并使用 CSV reader 来处理数据。通过响应传递结果后，Lambda 函数可以将处理移交给另一个 AWS 进程。

Answer 3

以下是使用 s3fs 读取 zip 存档中文件的示例。让 s3_file_path 是 S3 上的目标文件路径 -

import s3fs
from zipfile import ZipFile
import io

s3_file_path = '...'
fs = s3fs.S3FileSystem(anon=False)
input_zip = ZipFile(io.BytesIO(fs.cat(s3_file_path)))

encoding = 'ISO-8859-1'  # or 'utf-8'
for name in input_zip.namelist():
    data = input_zip.read(name).decode(encoding)
    print("filename: " + name)
    print("sample data: " + data[0:100])

您需要针对不同类型的文件进行调整encoding。

Answer 4

您可以为此使用 AWS Lambda。您可以编写使用 boto3 连接到 S3 的 Python 代码。然后您可以将文件读入缓冲区，并使用这些库解压缩它们：

import zipfile
import io

buffer = BytesIO(zipped_file.get()["Body"].read())
zipped = zipfile.ZipFile(buffer)
for file in zipped.namelist():
....

这里还有教程：https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

如何从 S3 中的 zip 存档中提取文件

How to extract files from a zip archive in S3

cloud

amazon-s3

amazon-web-services