gzip 是否可以压缩数据而不将其全部加载到内存中,即 streaming/on-the-fly?

Can gzip compress data without loading it all into memory, i.e. streaming/on-the-fly?

是否可以通过一定量的流对数据进行 gzip 压缩,即无需一次将所有压缩数据加载到内存中?

例如,我可以在内存为 2GB 的机器上压缩一个 10GB 的文件吗?

https://docs.python.org/3/library/gzip.html#gzip.compress处,gzip.compress函数returns gzip的字节,所以必须全部加载到内存中。但是......目前尚不清楚 gzip.open 内部是如何工作的:压缩后的字节是否会立即全部存储在内存中。 gzip 格式本身是否使实现流式 gzip 变得特别棘手?

[此问题带有 Python 标记,但也欢迎非 Python 答案]

您不必一次压缩所有 10GB。您可以分块读取输入数据,并分别压缩每个块,因此不必一次全部放入内存。

chunksize = 100 * 1024 * 1024 # 100 mb chunks
with open("bigfile.txt") as infile:
    while True:
        chunk = infile.read(chunksize)
        if not chunk:
            break
        compressed = gzip.compress(chunk)
        # do something with compressed

如果您要创建压缩文件,则可以将块直接写入 gzip 文件。

with open("bigfile.txt") as infile, gzip.open("bigfile.txt.gz", "w") as gzipfile:
    while True:
        chunk = infile.read(chunksize)
        if not chunk:
            break
        gzipfile.write(chunk)

[这是基于 和评论]

可以实现流式 gzip 压缩。 gzip module uses zlib which is documented to achieve streaming compression, and peeking into the gzip module source,它似乎没有将所有输出字节加载到内存中。

您也可以直接使用 zlib 模块执行此操作,例如使用小型生成器管道:

import zlib

def yield_uncompressed_bytes():
    # In a real case, would yield bytes pulled from the filesystem or the network
    chunk = b'*' * 65000
    for _ in range(0, 10000):
        print('In: ', len(chunk))
        yield chunk

def yield_compressed_bytes(_uncompressed_bytes):
    compress_obj = zlib.compressobj()
    for chunk in _uncompressed_bytes:
        if compressed_bytes := compress_obj.compress(chunk):
            yield compressed_bytes

    if compressed_bytes := compress_obj.flush():
        yield compressed_bytes

uncompressed_bytes = yield_uncompressed_bytes()
compressed_bytes = yield_compressed_bytes(uncompressed_bytes)

for chunk in compressed_bytes:
    # In a real case, could save to the filesystem, or send over the network
    print('Out:', len(chunk))

可以看到 In:Out: 穿插在一起,说明 zlib compressobj 确实没有将整个输出存储在内存中。