如何并行使用多个布尔掩码从 pandas DataFrame 列 select 多个子集？

Question

假设我们有一个包含两列（colA 和 colB）和 3 行的 DataFrame df_example，如以下代码所示：

df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())

输出：

   colA  colB
0    10    40
1    20    50
2    30    60

我需要根据布尔掩码从 colA 中检索多个子集。

例如，假设我们要从 colA 中提取 5 个子集。然后我们有 5 个布尔掩码，每个掩码包含 3 个布尔元素（因为 colA 包含 3 个 values/rows）。我将掩码存储在一个名为 mask_matrix 的矩阵中，其中每个掩码都存储为一行。

mask_matrix = np.array([
    # we need to get 5 subsets (so we have 5 rows in the mask_matrix)
    [True,  False, True ], # 1st subset: get the 1st and 3rd value from colA,
    [False, False, True ], # 2nd subset: get the 3rd value from colA,
    [True,  True,  True ], # 3rd subset: get all values from colA,
    [False, False, False], # 4th subset: get no values from colA,
    [False, True,  False]  # 5th subset: get the 2nd value from colA,
])

我需要将每个掩码（mask_matrix 的每一行）应用到 colA 并将 5 个结果存储在一个带有 dtype='object' 的 numpy 数组中（因为返回的结果具有不同的长度).

我可以使用以下代码按顺序执行此任务：

# I append the subsets here sequentially (this needs to be changed)
result = []

# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
for row in mask_matrix:
    result.append(df_example[row]['colA'].values)

# converting the result from a list to a numpy array (probably we won't need this in parallel solution?)
result = np.array(result, dtype='object')

然后打印 result:

# printing the result (just for clarification)
print('result:', type(result))
for x in result:
    print('  ', x)
    
print()
print('result.dtype:', result.dtype)
print('result.shape:', result.shape)

输出如下所示：

result: <class 'numpy.ndarray'>
   [10 30]
   [30]
   [10 20 30]
   []
   [20]

result.dtype: object
result.shape: (5,)

输出正确。但是，我想让这段代码运行更快。而不是这一行中的顺序 for 循环： for row in mask_matrix: 我想对过程进行矢量化并使它运行并行（就像 numpy 向量化操作）。当然，我的示例适用于非常小的数据，但实际上我将运行此代码用于具有大量掩码的大数据。

有没有办法向量化我提到的for循环执行的操作？我更喜欢在没有任何外部库的情况下使用 numpy and/or pandas 的方式（如果可能的话）。我将不胜感激任何帮助。

Answer 1

所以为了加快这个过程我做了以下事情：

import pandas as pd
import numpy as np
import time
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())

mask_matrix = np.array([
    # we need to get 5 subsets (so we have 5 rows in the mask_matrix)
    [True,  False, True ], # 1st subset: get the 1st and 3rd value from colA,
    [False, False, True ], # 2nd subset: get the 3rd value from colA,
    [True,  True,  True ], # 3rd subset: get all values from colA,
    [False, False, False], # 4th subset: get no values from colA,
    [False, True,  False]  # 5th subset: get the 2nd value from colA,
])

# I append the subsets here sequentially (this needs to be changed)
result = []

# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
start = time.process_time()
for row in mask_matrix:
    result.append(df_example[row]['colA'].values)
print("Baseline: {}".format( time.process_time() - start))


#====================================
#NEW CODE HERE
#Convert dataframe to numpy array
df_matrix = df_example.to_numpy().T
#Loop through all columns if desired
for column_idx in range(df_matrix.shape[0]):
    start = time.process_time()
    values = np.multiply(df_matrix[column_idx], mask_matrix)
    print("Vectorized: {}".format(time.process_time() - start))
    break
print(values.shape)
print(values)

这个returns下面

Baseline: 0.0007300000000000084
Vectorized: 4.500000000007276e-05
(5, 3)
[[10  0 30]
 [ 0  0 30]
 [10 20 30]
 [ 0  0  0]
 [ 0 20  0]]

这里有一些方法，而不是在布尔矩阵中使用 False，您可以使用 np.nan，它会产生以下结果：

[[10. nan 30.]
 [nan nan 30.]
 [10. 20. 30.]
 [nan nan  nan]
 [nan 20. nan]]

如果你想删除 nans，你将不得不循环，但我认为这样效率很低。

如何并行使用多个布尔掩码从 pandas DataFrame 列 select 多个子集？

How to select multiple subsets from a pandas DataFrame column using multiple boolean masks in parallel?

python

arrays

parallel-processing

numpy

pandas