在 python 中更快地读取文件

Question

我正在编写一个脚本来读取一个 TXT 文件，其中每一行都是一个日志条目，我需要将此日志分隔在不同的文件中（对于所有 Hor、Sia、Lmu）。使用我的测试文件 (80kb) 时，我正在阅读每一行并在新文件中划分没有问题，但是当我尝试应用于实际文件（177MB - 大约 500k 行）时，它花费的时间太长。花了一个多小时，它仍然在 80K 行读取。

行是这样的：

Crm|Hor|SiebelSeed

Crm|Sia|SiebelSeed

Crm|Lmu|LMU|

有什么方法可以使它运行更快？

我的代码

with open(path, "r", encoding="UTF-16") as file:
    for i, line in enumerate(file): 
    
            if i > 2: # lines 1-2 are headers
                component = re.match(r"Crm\|([A-Za-z0-9_]+)|]", line).group(1)
                
                if component not in comp_list:
                    comp_list.append(component)
                    
                    with open(f'HHR_Splitter/output/{component}.txt','w+', encoding="UTF-16") as new_file:
                        new_file.write('{}'.format(line))
                        
                        
                if component in comp_list:
                    
                    with open(f'HHR_Splitter/output/{component}.txt','a+', encoding="UTF-16") as existing_file: 
                        existing_file.write('{}'.format(line))

                else:
                    break

Answer 1

我发现的第一件事是您正在打开每一行的输出文件。您可以打开它们一次，它们会处理所有行。这同样适用于正则表达式：您可以在 for 循环之前用 re.compile()

计算一次

这是一个例子：

def process_log(input_file, output_files):
    prog = re.compile(r"Crm\|([A-Za-z0-9_]+)|]")
    for i, line in enumerate(file):
        if i > 2:
           component = prog.match(line).group(1)
           output_files[component].write('{}'.format(line))

def open_outputs_files():
     output_files = {}
     components = ["Crm", "Hor", "Sia", "Lmu", "SiebelSeed"]
     for component in components:
         with open(f'HHR_Splitter/output/{component}.txt','w+', encoding="UTF-16") as new_file:
             output_files[component] = new_file
     return output_files

with open(path, "r", encoding="UTF-16") as input_file:
    output_files = open_outputs_files()
    process_log(input_file, output_files)

在 python 中更快地读取文件

Reading files faster in python

python

performance

file