使用多个内核同时处理多个数据文件

Question

我有多个使用 python Pandas 库处理的数据文件。每个文件一个一个处理，我看任务管理器只用了一个逻辑处理器（在~95%，其余都在5%以内）

有没有办法同时处理数据文件？如果是这样，有没有办法利用其他逻辑处理器来做到这一点？

（欢迎编辑）

Answer 1

您可以在不同的线程或不同的进程中处理不同的文件。

python 的好处是它的框架为您提供了执行此操作的工具：

from multiprocessing import Process

def process_panda(filename):
    # this function will be started in a different process
    process_panda_import()
    write_results()

if __name__ == '__main__':
    p1 = Process(target=process_panda, args=('file1',))
    # start process 1
    p1.start() 
    p2 = Process(target=process_panda, args=('file2',))
    # starts process 2
    p2.start() 
    # waits if process 2 is finished
    p2.join()  
    # waits if process 1 is finished
    p1.join()

该程序将启动 2 child-processes，可用于处理您的文件。当然你可以用线程做类似的事情。

您可以在此处找到文档： https://docs.python.org/2/library/multiprocessing.html

这里：

https://pymotw.com/2/threading/

Answer 2

如果您的文件名在列表中，您可以使用此代码：

from multiprocessing import Process

def YourCode(filename, otherdata):
    # Do your stuff

if __name__ == '__main__':
    #Post process files in parallel
    ListOfFilenames = ['file1','file2', ..., 'file1000']
    ListOfProcesses = []
    Processors = 20 # n of processors you want to use
    #Divide the list of files in 'n of processors' Parts
    Parts = [ListOfFilenames[i:i + Processors] for i in xrange(0, len(ListOfFilenames), Processors)]

    for part in Parts:
        for f in part:
            p = multiprocessing.Process(target=YourCode, args=(f, otherdata))
            p.start()
            ListOfProcesses.append(p)
        for p in ListOfProcesses:
            p.join()

使用多个内核同时处理多个数据文件

Processing multiple data files simultaneously using multiple cores

python

multicore

python-3.x