带有 for 循环的多处理池

Question

我有一个文件列表，我将这些文件传递给 for 循环并执行一大堆功能。什么是并行化这个最简单的方法？不确定我能在任何地方找到这个确切的东西，我认为我当前的实现是不正确的，因为我只看到一个文件是运行。根据我的一些阅读，我认为这应该是一个完美的平行案例。

旧代码是这样的：

import pandas as pd
filenames = ['file1.csv', 'file2.csv', 'file3.csv', 'file4.csv']
for file in filenames:
    file1 = pd.read_csv(file)
    print('running ' + str(file))
    a = function1(file1)
    b = function2(a)
    c = function3(b)
    for d in range(1,6):
            e = function4(c, d)
    c.to_csv('output.csv')

（错误地）并行代码

import pandas as pd
from multiprocessing import Pool
filenames = ['file1.csv', 'file2.csv', 'file3.csv', 'file4.csv']
def multip(filenames):
    file1 = pd.read_csv(file)
    print('running ' + str(file))
    a = function1(file1)
    b = function2(a)
    c = function3(b)
    for d in range(1,6):
            e = function4(c, d)
    c.to_csv('output.csv')

if __name__ == '__main__'
    pool = Pool(processes=4)
    runstuff = pool.map(multip(filenames))

我（思考）我想做的是计算一个文件 每个核心（也许每个进程？）。我也做了

multiprocessing.cpu_count()

得到 8 个（我有一个四边形，所以它可能考虑了线程）。由于我总共有大约 10 个文件，如果我可以为每个进程放置一个文件来加快速度，那就太好了！我希望剩下的2个文件在第一轮的过程完成后也能找到一个过程。

编辑：为了进一步清楚起见，函数（即 function1、function2 等）还提供给各自文件中的其他函数（即 function1a、function1b）。我使用 import 语句调用函数 1。

我收到以下错误：

OSError: Expected file path name or file-like object, got <class 'list'> type

显然不喜欢传递列表，但我不想在 if 语句中执行文件名[0]，因为只有运行一个文件

Answer 1

import multiprocessing
names = ['file1.csv', 'file2.csv']
def multip(name):
     [do stuff here]

if __name__ == '__main__':
    #use one less process to be a little more stable
    p = multiprocessing.Pool(processes = multiprocessing.cpu_count()-1)
    #timing it...
    start = time.time()
    for file in names:
    p.apply_async(multip, [file])

    p.close()
    p.join()
    print("Complete")
    end = time.time()
    print('total time (s)= ' + str(end-start))

编辑：将 if__name__== '____main___' 换成这个。这将运行所有文件：

if __name__ == '__main__':

    p = Pool(processes = len(names))
    start = time.time()
    async_result = p.map_async(multip, names)
    p.close()
    p.join()
    print("Complete")
    end = time.time()
    print('total time (s)= ' + str(end-start))

带有 for 循环的多处理池

Multiprocessing Pool with a for loop

python-3.x

python-multiprocessing