如何并行使用多个布尔掩码从 pandas DataFrame 列 select 多个子集?
How to select multiple subsets from a pandas DataFrame column using multiple boolean masks in parallel?
假设我们有一个包含两列(colA
和 colB
)和 3 行的 DataFrame df_example
,如以下代码所示:
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())
输出:
colA colB
0 10 40
1 20 50
2 30 60
我需要根据布尔掩码从 colA
中检索多个子集。
例如,假设我们要从 colA
中提取 5 个子集。然后我们有 5 个布尔掩码,每个掩码包含 3 个布尔元素(因为 colA
包含 3 个 values/rows)。我将掩码存储在一个名为 mask_matrix
的矩阵中,其中每个掩码都存储为一行。
mask_matrix = np.array([
# we need to get 5 subsets (so we have 5 rows in the mask_matrix)
[True, False, True ], # 1st subset: get the 1st and 3rd value from colA,
[False, False, True ], # 2nd subset: get the 3rd value from colA,
[True, True, True ], # 3rd subset: get all values from colA,
[False, False, False], # 4th subset: get no values from colA,
[False, True, False] # 5th subset: get the 2nd value from colA,
])
我需要将每个掩码(mask_matrix
的每一行)应用到 colA
并将 5 个结果存储在一个带有 dtype='object'
的 numpy 数组中(因为返回的结果具有不同的长度).
我可以使用以下代码按顺序执行此任务:
# I append the subsets here sequentially (this needs to be changed)
result = []
# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
for row in mask_matrix:
result.append(df_example[row]['colA'].values)
# converting the result from a list to a numpy array (probably we won't need this in parallel solution?)
result = np.array(result, dtype='object')
然后打印 result
:
# printing the result (just for clarification)
print('result:', type(result))
for x in result:
print(' ', x)
print()
print('result.dtype:', result.dtype)
print('result.shape:', result.shape)
输出如下所示:
result: <class 'numpy.ndarray'>
[10 30]
[30]
[10 20 30]
[]
[20]
result.dtype: object
result.shape: (5,)
输出正确。但是,我想让这段代码 运行 更快。而不是这一行中的顺序 for 循环: for row in mask_matrix:
我想对过程进行矢量化并使它 运行 并行(就像 numpy 向量化操作)。当然,我的示例适用于非常小的数据,但实际上我将 运行 此代码用于具有大量掩码的大数据。
有没有办法向量化我提到的for循环执行的操作?我更喜欢在没有任何外部库的情况下使用 numpy and/or pandas 的方式(如果可能的话)。我将不胜感激任何帮助。
所以为了加快这个过程我做了以下事情:
import pandas as pd
import numpy as np
import time
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())
mask_matrix = np.array([
# we need to get 5 subsets (so we have 5 rows in the mask_matrix)
[True, False, True ], # 1st subset: get the 1st and 3rd value from colA,
[False, False, True ], # 2nd subset: get the 3rd value from colA,
[True, True, True ], # 3rd subset: get all values from colA,
[False, False, False], # 4th subset: get no values from colA,
[False, True, False] # 5th subset: get the 2nd value from colA,
])
# I append the subsets here sequentially (this needs to be changed)
result = []
# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
start = time.process_time()
for row in mask_matrix:
result.append(df_example[row]['colA'].values)
print("Baseline: {}".format( time.process_time() - start))
#====================================
#NEW CODE HERE
#Convert dataframe to numpy array
df_matrix = df_example.to_numpy().T
#Loop through all columns if desired
for column_idx in range(df_matrix.shape[0]):
start = time.process_time()
values = np.multiply(df_matrix[column_idx], mask_matrix)
print("Vectorized: {}".format(time.process_time() - start))
break
print(values.shape)
print(values)
这个returns下面
Baseline: 0.0007300000000000084
Vectorized: 4.500000000007276e-05
(5, 3)
[[10 0 30]
[ 0 0 30]
[10 20 30]
[ 0 0 0]
[ 0 20 0]]
这里有一些方法,而不是在布尔矩阵中使用 False,您可以使用 np.nan,它会产生以下结果:
[[10. nan 30.]
[nan nan 30.]
[10. 20. 30.]
[nan nan nan]
[nan 20. nan]]
如果你想删除 nans,你将不得不循环,但我认为这样效率很低。
假设我们有一个包含两列(colA
和 colB
)和 3 行的 DataFrame df_example
,如以下代码所示:
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())
输出:
colA colB
0 10 40
1 20 50
2 30 60
我需要根据布尔掩码从 colA
中检索多个子集。
例如,假设我们要从 colA
中提取 5 个子集。然后我们有 5 个布尔掩码,每个掩码包含 3 个布尔元素(因为 colA
包含 3 个 values/rows)。我将掩码存储在一个名为 mask_matrix
的矩阵中,其中每个掩码都存储为一行。
mask_matrix = np.array([
# we need to get 5 subsets (so we have 5 rows in the mask_matrix)
[True, False, True ], # 1st subset: get the 1st and 3rd value from colA,
[False, False, True ], # 2nd subset: get the 3rd value from colA,
[True, True, True ], # 3rd subset: get all values from colA,
[False, False, False], # 4th subset: get no values from colA,
[False, True, False] # 5th subset: get the 2nd value from colA,
])
我需要将每个掩码(mask_matrix
的每一行)应用到 colA
并将 5 个结果存储在一个带有 dtype='object'
的 numpy 数组中(因为返回的结果具有不同的长度).
我可以使用以下代码按顺序执行此任务:
# I append the subsets here sequentially (this needs to be changed)
result = []
# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
for row in mask_matrix:
result.append(df_example[row]['colA'].values)
# converting the result from a list to a numpy array (probably we won't need this in parallel solution?)
result = np.array(result, dtype='object')
然后打印 result
:
# printing the result (just for clarification)
print('result:', type(result))
for x in result:
print(' ', x)
print()
print('result.dtype:', result.dtype)
print('result.shape:', result.shape)
输出如下所示:
result: <class 'numpy.ndarray'>
[10 30]
[30]
[10 20 30]
[]
[20]
result.dtype: object
result.shape: (5,)
输出正确。但是,我想让这段代码 运行 更快。而不是这一行中的顺序 for 循环: for row in mask_matrix:
我想对过程进行矢量化并使它 运行 并行(就像 numpy 向量化操作)。当然,我的示例适用于非常小的数据,但实际上我将 运行 此代码用于具有大量掩码的大数据。
有没有办法向量化我提到的for循环执行的操作?我更喜欢在没有任何外部库的情况下使用 numpy and/or pandas 的方式(如果可能的话)。我将不胜感激任何帮助。
所以为了加快这个过程我做了以下事情:
import pandas as pd
import numpy as np
import time
df_example = pd.DataFrame({'colA': [10, 20, 30], 'colB': [40, 50, 60]})
print(df_example.head())
mask_matrix = np.array([
# we need to get 5 subsets (so we have 5 rows in the mask_matrix)
[True, False, True ], # 1st subset: get the 1st and 3rd value from colA,
[False, False, True ], # 2nd subset: get the 3rd value from colA,
[True, True, True ], # 3rd subset: get all values from colA,
[False, False, False], # 4th subset: get no values from colA,
[False, True, False] # 5th subset: get the 2nd value from colA,
])
# I append the subsets here sequentially (this needs to be changed)
result = []
# I need to make this loop parallel
# And if possible, I need to get the result as a numpy array directly (not as a list)
start = time.process_time()
for row in mask_matrix:
result.append(df_example[row]['colA'].values)
print("Baseline: {}".format( time.process_time() - start))
#====================================
#NEW CODE HERE
#Convert dataframe to numpy array
df_matrix = df_example.to_numpy().T
#Loop through all columns if desired
for column_idx in range(df_matrix.shape[0]):
start = time.process_time()
values = np.multiply(df_matrix[column_idx], mask_matrix)
print("Vectorized: {}".format(time.process_time() - start))
break
print(values.shape)
print(values)
这个returns下面
Baseline: 0.0007300000000000084
Vectorized: 4.500000000007276e-05
(5, 3)
[[10 0 30]
[ 0 0 30]
[10 20 30]
[ 0 0 0]
[ 0 20 0]]
这里有一些方法,而不是在布尔矩阵中使用 False,您可以使用 np.nan,它会产生以下结果:
[[10. nan 30.]
[nan nan 30.]
[10. 20. 30.]
[nan nan nan]
[nan 20. nan]]
如果你想删除 nans,你将不得不循环,但我认为这样效率很低。