gzip 是否可以压缩数据而不将其全部加载到内存中，即 streaming/on-the-fly？

Question

是否可以通过一定量的流对数据进行 gzip 压缩，即无需一次将所有压缩数据加载到内存中？

例如，我可以在内存为 2GB 的机器上压缩一个 10GB 的文件吗？

在https://docs.python.org/3/library/gzip.html#gzip.compress处，gzip.compress函数returns gzip的字节，所以必须全部加载到内存中。但是......目前尚不清楚 gzip.open 内部是如何工作的：压缩后的字节是否会立即全部存储在内存中。 gzip 格式本身是否使实现流式 gzip 变得特别棘手？

[此问题带有 Python 标记，但也欢迎非 Python 答案]

Answer 1

您不必一次压缩所有 10GB。您可以分块读取输入数据，并分别压缩每个块，因此不必一次全部放入内存。

chunksize = 100 * 1024 * 1024 # 100 mb chunks
with open("bigfile.txt") as infile:
    while True:
        chunk = infile.read(chunksize)
        if not chunk:
            break
        compressed = gzip.compress(chunk)
        # do something with compressed

如果您要创建压缩文件，则可以将块直接写入 gzip 文件。

with open("bigfile.txt") as infile, gzip.open("bigfile.txt.gz", "w") as gzipfile:
    while True:
        chunk = infile.read(chunksize)
        if not chunk:
            break
        gzipfile.write(chunk)

Answer 2

[这是基于和评论]

您可以实现流式 gzip 压缩。 gzip module uses zlib which is documented to achieve streaming compression, and peeking into the gzip module source，它似乎没有将所有输出字节加载到内存中。

您也可以直接使用 zlib 模块执行此操作，例如使用小型生成器管道：

import zlib

def yield_uncompressed_bytes():
    # In a real case, would yield bytes pulled from the filesystem or the network
    chunk = b'*' * 65000
    for _ in range(0, 10000):
        print('In: ', len(chunk))
        yield chunk

def yield_compressed_bytes(_uncompressed_bytes):
    compress_obj = zlib.compressobj()
    for chunk in _uncompressed_bytes:
        if compressed_bytes := compress_obj.compress(chunk):
            yield compressed_bytes

    if compressed_bytes := compress_obj.flush():
        yield compressed_bytes

uncompressed_bytes = yield_uncompressed_bytes()
compressed_bytes = yield_compressed_bytes(uncompressed_bytes)

for chunk in compressed_bytes:
    # In a real case, could save to the filesystem, or send over the network
    print('Out:', len(chunk))

可以看到 In: 和 Out: 穿插在一起，说明 zlib compressobj 确实没有将整个输出存储在内存中。

gzip 是否可以压缩数据而不将其全部加载到内存中，即 streaming/on-the-fly？

Can gzip compress data without loading it all into memory, i.e. streaming/on-the-fly?

python

compression

gzip