在 Jupyter Windows 上使用池并行读取多个文件需要很长时间：

Question

我想读取 22 个文件（存储在我的硬盘上），每个文件大约有 300,000 行以存储在单个 pandas 数据框中。我的代码能够在 15-25 分钟内完成。我最初的想法是：我应该使用更多 CPU 使其更快。（如果我在这里错了，请纠正我，如果所有 CPU 不能同时从同一硬盘读取数据，但是，在这种情况下，我们可以假设数据可能存在于以后有不同的硬盘，所以这个练习还是有用的)。

我发现很少有帖子像 and 并尝试了下面的代码。

import os
import pandas as pd
from multiprocessing import Pool

def read_psv(filename):
    'reads one row of a file (pipe delimited) to a pandas dataframe'
    return pd.read_csv(filename,
                       delimiter='|',
                       skiprows=1, #need this as first row is junk
                       nrows=1, #Just one row for faster testing                    
                       encoding = "ISO-8859-1", #need this as well                       
                       low_memory=False
                      )



files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second

pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 =  pd.concat(df_list, ignore_index=True) #cant reach this

这需要很长时间（超过 30-60 分钟，当我终止进程时还没有完成）。我也经历了 similar question like mine 但没有用。

编辑：我在 Windows.

上使用 Jupyter

Answer 1

你的任务是 IO-bound，瓶颈是硬盘。 CPU 只需要做一点工作来解析 CSV 中的每一行。

顺序读取磁盘时速度最快。如果你想读取一个大文件，最好让磁盘寻找开始，然后按顺序读取它的所有字节。

如果您在同一个硬盘上有多个大文件并使用多个进程读取它们，那么磁头将不得不在它们之间来回跳转，每次跳转最多需要 10 毫秒。

Multiprocessing 仍然可以让你的代码更快，但是你需要将你的文件存储在多个磁盘上，所以每个磁盘头可以专注于读取一个文件。

另一种选择是购买 SSD。磁盘寻道时间低得多，仅为 0.1 毫秒，吞吐量快约 5 倍。

Answer 2

所以这个问题与 性能不佳 或卡在 I/O 无关。该问题与 Jupyter 和 Windows 有关。在 Windows 上，我们需要包含一个 if 子句，如下所示： if __name__ = '__main__': 在初始化池之前。对于 Jupyter，我们需要将 worker 保存在一个单独的文件中，并在代码中导入。 Jupyter 也有问题，因为它默认不提供错误日志。当我运行 python shell 上的代码时，我了解了 windows 问题。当我运行 Ipython Shell 上的代码时，我了解了 Jupyter 错误。关注 post 对我帮助很大。

For Windows Issue

在 Jupyter Windows 上使用池并行读取多个文件需要很长时间：

Using pool to read multiple files in parallel takes forever on Jupyter Windows:

python

windows

pandas

python-multiprocessing

jupyter-notebook