我需要拆分一个非常大的文本文件

Question

我有一个很大的文本文件（超过我的 RAM），我需要使用其中的每一行进行进一步处理。但是，如果我一次读到 4096 个字节，我担心会在中间某处拆分该行。我该如何进行？

Answer 1

有人在音频编码很多地方做这种事情，文件可能很大。据我所知，正常的方法只是拥有一个内存缓冲区并分两个阶段进行：将任意大小的 blob 读入缓冲区（4096 或其他），然后从缓冲区流式传输字符，对行尾做出反应。因为缓冲区在 ram 中，所以从中逐个字符地流出是很快的。我不确定在 Python 中最好使用哪种数据结构或调用，但实际上我只在 C 中完成过，它只是一块 ram。但同样的方法应该有效。

Answer 2

使用生成器读取文件：

def read_file(file_path):
    with open(file_path, 'r') as lines:
        for line in lines:
            yield line

这样一来，您一次不会在内存中有超过一行，但仍会按顺序读取文件。

Answer 3

您可以执行以下操作：

SIZE = 1024

with open('file.txt') as f:
    old, data = '', f.read(SIZE)

    while data:
          # (1)
        lines = data.splitlines()
        if not data.endswith('\n'):
            old = lines[-1]
        else:
            old = ''

        # process stuff

        data = old + f.read(SIZE)

如果您这样做data.splitlines(True)，那么换行符将保留在结果列表中。

Answer 4

在 linux 上：

将其放入 python 脚本中，例如 process.py:

import sys

for line in sys.stdin:
    #do something with the line, for example:
    output = line[:5] + line[10:15]
    sys.stdout.write("{}\n".format(output))

到运行脚本，使用：

cat input_data | python process.py > output

我需要拆分一个非常大的文本文件

I need to split a very large text file

python

file-handling