使用 python 将 S3 gzip 源对象流式解压缩到 S3 目标对象？

Question

给定 S3 中的大型 gzip 对象，python3/boto3 中的内存高效（例如流式处理）方法是什么来解压缩数据并将结果存储回另一个 S3对象?

有一个similar question以前问过。但是，所有答案都使用一种方法，其中首先将 gzip 文件的内容读入内存（例如 ByteIO）。这些解决方案对于太大而无法放入主内存的对象不可行。

对于大型 S3 对象，需要读取内容、“即时”解压缩，然后写入不同的 S3 对象是某种分块方式。

提前感谢您的考虑和回复。

Answer 1

您可以将流方法与 boto / s3 一起使用，但您必须定义自己的 file-like 对象 AFAIK。
幸运的是 smart_open 可以为您处理这些问题；它还支持 GCS、Azure、HDFS、SFTP 等。
下面是一个使用大量 sample 销售数据的示例：

import boto3
from smart_open import open

session = boto3.Session()  # you need to set auth credentials here if you don't have them set in your environment
chunk_size = 1024 * 1024  # 1 MB
f_in = open("s3://mybucket/2m_sales_records.csv.gz", transport_params=dict(session=session), encoding="utf-8")
f_out = open("s3://mybucket/2m_sales_records.csv", "w", transport_params=dict(session=session))
byte_count = 0
while True:
    data = f_in.read(chunk_size)
    if not data:
        break
    f_out.write(data)
    byte_count += len(data)
    print(f"wrote {byte_count} bytes so far")
f_in.close()
f_out.close()

样本文件有 200 万 行，压缩 75 MB 和 238 MB未压缩。
我将压缩文件上传到 mybucket 和运行下载文件的代码，提取内存中的内容并将未压缩的数据上传回 S3。
在我的电脑上，这个过程大约需要 78 秒（高度依赖于互联网连接速度）并且从未使用超过 95 MB 的内存；我认为如果需要，您可以通过在 smart_open.[=15 中覆盖 S3 分段上传的 part size 来降低内存要求=]

DEFAULT_MIN_PART_SIZE = 50 * 1024**2
"""Default minimum part size for S3 multipart uploads"""
MIN_MIN_PART_SIZE = 5 * 1024 ** 2
"""The absolute minimum permitted by Amazon."""

使用 python 将 S3 gzip 源对象流式解压缩到 S3 目标对象？

Streaming decompression of S3 gzip source object to a S3 destination object using python?

python

gzip

amazon-s3

amazon-web-services

boto3