如何有效地分配预定义大小的文件并使用 Python 将其填充为非零值？

Question

我正在编写一个程序，使用动态规划来解决一个难题。 DP解决方案需要存储一个大的table。完整的 table 占用大约 300 Gb。物理上它存储在 40 ~7Gb 文件中。我用字节 \xFF 标记未使用的 table 条目。我想尽快为这个 table 分配 space。该程序必须在 Windows 和 Linux 下运行。

简而言之，我想以跨平台的方式高效地创建填充特定字节的大文件。

这是我目前使用的代码：

def reset_storage(self, path):
    fill = b'\xFF'

    with open(path, 'wb') as f:
        for _ in range(3715948544 * 2):
            f.write(fill)

创建一个7Gb的文件大约需要40分钟。我如何加快速度？

我查看了其他问题，但其中 none 似乎是相关的：

Allocate a file of particular size in Linux with python — 没有答案
create file of particular size in python — 文件被 [=13=] 填充或者解决方案是 Windows-only
How to create a file with a given size in Linux? — 所有解决方案都是 Linux-特定的

Answer 1

您的问题是经常调用 python 方法（对于每个字节！）。我提供的肯定不是完美的，但会快很多很多倍。请尝试以下操作：

fill = b"\xFF" * 1024 * 1024  # instantly 1 MiB of ones
...
file_size = 300 * 1024  # in MiB now!
with open(path, 'wb') as f:
    for _ in range(file_size):
        f.write(fill)

Answer 2

写块，而不是字节，避免无缘无故地迭代巨大的 ranges。

import itertools

def reset_storage(self, path):
    total = 3715948544 * 2
    block_size = 4096  # Tune this if needed, just make sure it's a factor of the total
    fill = b'\xFF' * block_size

    with open(path, 'wb') as f:
        f.writelines(itertools.repeat(fill, total // block_size))
        # If you want to handle initialization of arbitrary totals without
        # needing to be careful that block_size evenly divides total, add
        # a single:
        # f.write(fill[:total % block_size])
        # here to write out the incomplete block.

理想的块大小因系统而异。一种合理的选择是使用 io.DEFAULT_BUFFER_SIZE 自动匹配写入刷新，同时仍然保持低内存使用率。

如何有效地分配预定义大小的文件并使用 Python 将其填充为非零值？

How to efficiently allocate a file of predefined size and fill it with a non-zero value using Python?

python

linux

windows

file

bigdata