多个子进程如何写入 python 中的同一个共享内存数据帧？

Question

我的 python 代码采用多处理。在父程序的共享内存中创建了一个数据框，比方说，ns.df 其中 ns 是名称空间管理器实例。

多个进程需要向此 ns.df 添加数据行，以便在进程终止后在父程序中反映所有更改。

进程不需要相互交互，因为进程之间不共享或传递数据。每个进程要写入的数据是互斥的，只独立于该进程。

会做一个简单的

ns.df = pd.concat([ns.df, tempdf], axis=0, sort=True)

从每个子进程中获取是否足以达到预期的结果？这里 tempdf 是包含要添加到 ns.df.

所需数据的数据框

我如何在 python 中实现这一点？任何帮助将不胜感激。

Answer 1

我不会在每个子进程中单独将行添加到 ns.df，而是在每个子进程终止后收集它们。看这个例子：

from concurrent.futures import ProcessPoolExecutor

import pandas as pd

def child_process(child_id):
    return pd.DataFrame({"column": [f"child_{child_id}"]})

df_main = pd.DataFrame({"column": ["parent"]})

with ProcessPoolExecutor(max_workers=4) as pool:
    child_dfs = list(pool.map(child_process, range(5)))

df_all = pd.concat([df_main, *child_dfs])
print(df_all)

输出

    column
0   parent
0  child_0
0  child_1
0  child_2
0  child_3
0  child_4

如果在每个子进程中更改ns.df，它实际上是一个共享内存对象。

警告：如果子进程返回的数据帧非常大，那么使用多处理可能会增加大量开销，因为在主进程中重新加载数据帧之前必须对其进行 pickle。根据实际子进程的作用（可能很多 I/O 或者它使用释放 GIL 的 C 函数），最好使用多线程。

多个子进程如何写入 python 中的同一个共享内存数据帧？

How can multiple child processes write in the same shared memory dataframe in python?

python

shared-memory

pandas

python-multiprocessing