用 faster/more 有效的替代方法替换数据帧上的嵌套循环
Replacing nested loops over a dataframe with faster/more efficient alternatives
我想消除我的代码中的嵌套循环,但我似乎找不到最好的方法。
我已经在下面解释了我想做的事情:
我有一个数据框 df。
data = [['1A', 'apple', '35-44', 'male', ['apple', 'strawberry', 'pineapple']], ['1B', 'banana', '15-24', 'female', ['apple', 'banana', 'durian']], \
['1C', 'cranberry', '35-44', 'male', ['cranberry', 'apple', 'durian']], ['1D','durian', '15-24', 'female', ['durian', 'kiwi', 'banana']], \
['1E', 'elderberry', '35-44', 'male', ['elderberry', 'apple', 'papaya']]]
df = pd.DataFrame(data, columns= ['ID','fav_fruit','age_group', 'gender', 'top3_fruits'])
ID fav_fruit age_group gender top3_fruits
0 1A apple 35-44 male [apple, strawberry, pineapple]
1 1B banana 15-24 female [apple, banana, durian]
2 1C cranberry 35-44 male [cranberry, apple, durian]
3 1D durian 15-24 female [durian, kiwi, banana]
4 1E elderberry 35-44 male [elderberry, apple, papaya]
现在,在此数据框中,我想在特定条件下检查每一行并将其与所有其他行进行比较。
- 我想检查 age_group 和性别是否相等
- 我想检查 fav_fruit 是否在 top3_fruits 中。
如果满足条件,那么我想将匹配行的 'ID' 和 'top3_fruits' 作为单独的列附加到数据框 df 的末尾。
这是我用嵌套 for-loop.
编写的代码
df_copy = df.copy()
sample_df = pd.DataFrame()
matching_id = []
fruits_to_recommend = []
for i in range(len(df)):
for j in range(len(df)):
if (i!=j) and (df.iloc[i]['fav_fruit'] in df_copy.iloc[j]['top3_fruits']) and \
(df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
(df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
sample_df = sample_df.append(df_copy.iloc[[i]])
matching_id.append(df_copy.iloc[j]['ID'])
fruits_to_recommend.append(df_copy.iloc[j]['top3_fruits'])
sample_df['matching_id'] = matching_id
sample_df['fruits_to_recommend'] = fruits_to_recommend
我要查找的结果如下所示。
结果:
我正在寻找更多 feasible/faster 选项。
我的方法是使用 .explode()
method and pandas.merge()
函数。
>>> df_explode = df.copy()
>>> # copy column
>>> df_explode['fruits_to_recommend'] = df['top3_fruits']
>>> # explode list and rename column
>>> df_explode = df_explode.explode('top3_fruits').rename(columns={'ID':'matching_id'})
>>> print(df_explode)
matching_id fav_fruit age_group gender top3_fruits fruits_to_recommend
0 1A apple 35-44 male apple ['apple', 'strawberry', 'pineapple']
0 1A apple 35-44 male strawberry ['apple', 'strawberry', 'pineapple']
0 1A apple 35-44 male pineapple ['apple', 'strawberry', 'pineapple']
1 1B banana 15-24 female apple ['apple', 'banana', 'durian']
1 1B banana 15-24 female banana ['apple', 'banana', 'durian']
1 1B banana 15-24 female durian ['apple', 'banana', 'durian']
2 1C cranberry 35-44 male cranberry ['cranberry', 'apple', 'durian']
2 1C cranberry 35-44 male apple ['cranberry', 'apple', 'durian']
2 1C cranberry 35-44 male durian ['cranberry', 'apple', 'durian']
3 1D durian 15-24 female durian ['durian', 'kiwi', 'banana']
3 1D durian 15-24 female kiwi ['durian', 'kiwi', 'banana']
3 1D durian 15-24 female banana ['durian', 'kiwi', 'banana']
4 1E elderberry 35-44 male elderberry ['elderberry', 'apple', 'papaya']
4 1E elderberry 35-44 male apple ['elderberry', 'apple', 'papaya']
4 1E elderberry 35-44 male papaya ['elderberry', 'apple', 'papaya']
>>> # merging
>>> df_merged = pd.merge(df, df_explode, how='left', left_on = ['age_group', 'gender', 'fav_fruit'], right_on = ['age_group', 'gender', 'top3_fruits'], suffixes=('','_'))
>>> # select columns and filter matching_id's which are equal to ID
>>> df_merged = df_merged.loc[df_merged['ID']!=df_merged['matching_id'], list(df.columns) + ['matching_id', 'fruits_to_recommend']]
>>> print(df_merged)
ID fav_fruit age_group gender top3_fruits matching_id fruits_to_recommend
1 1A apple 35-44 male ['apple', 'strawberry', 'pineapple'] 1C ['cranberry', 'apple', 'durian']
2 1A apple 35-44 male ['apple', 'strawberry', 'pineapple'] 1E ['elderberry', 'apple', 'papaya']
4 1B banana 15-24 female ['apple', 'banana', 'durian'] 1D ['durian', 'kiwi', 'banana']
6 1D durian 15-24 female ['durian', 'kiwi', 'banana'] 1B ['apple', 'banana', 'durian']
首先检查您的 3 个条件,然后构建一个数据框,其中包含每一行的匹配行。最后把它加入回原来的df
import pandas as pd
import numpy as np
age_group_match = df.age_group.values == df.age_group.values[:, None]
gender_match = df.gender.values == df.gender.values[:, None]
fruit_match = [[ff in top3 for top3 in df.top3_fruits] for ff in df.fav_fruit]
match_res = age_group_match * gender_match * fruit_match
np.fill_diagonal(match_res, False)
df_match = (
pd.DataFrame([[df.ID[e], df.top3_fruits[e]] for e in match_res],
columns=['matching_id', 'fruits_to_recommend'])
.apply(pd.Series.explode)
.dropna()
)
df.join(df_match, how='inner')
ID fav_fruit age_group gender top3_fruits matching_id fruits_to_recommend
0 1A apple 35-44 male [apple, strawberry, pineapple] 1C [cranberry, apple, durian]
0 1A apple 35-44 male [apple, strawberry, pineapple] 1E [elderberry, apple, papaya]
1 1B banana 15-24 female [apple, banana, durian] 1D [durian, kiwi, banana]
3 1D durian 15-24 female [durian, kiwi, banana] 1B [apple, banana, durian]
我想消除我的代码中的嵌套循环,但我似乎找不到最好的方法。 我已经在下面解释了我想做的事情:
我有一个数据框 df。
data = [['1A', 'apple', '35-44', 'male', ['apple', 'strawberry', 'pineapple']], ['1B', 'banana', '15-24', 'female', ['apple', 'banana', 'durian']], \
['1C', 'cranberry', '35-44', 'male', ['cranberry', 'apple', 'durian']], ['1D','durian', '15-24', 'female', ['durian', 'kiwi', 'banana']], \
['1E', 'elderberry', '35-44', 'male', ['elderberry', 'apple', 'papaya']]]
df = pd.DataFrame(data, columns= ['ID','fav_fruit','age_group', 'gender', 'top3_fruits'])
ID fav_fruit age_group gender top3_fruits
0 1A apple 35-44 male [apple, strawberry, pineapple]
1 1B banana 15-24 female [apple, banana, durian]
2 1C cranberry 35-44 male [cranberry, apple, durian]
3 1D durian 15-24 female [durian, kiwi, banana]
4 1E elderberry 35-44 male [elderberry, apple, papaya]
现在,在此数据框中,我想在特定条件下检查每一行并将其与所有其他行进行比较。
- 我想检查 age_group 和性别是否相等
- 我想检查 fav_fruit 是否在 top3_fruits 中。
如果满足条件,那么我想将匹配行的 'ID' 和 'top3_fruits' 作为单独的列附加到数据框 df 的末尾。
这是我用嵌套 for-loop.
编写的代码df_copy = df.copy()
sample_df = pd.DataFrame()
matching_id = []
fruits_to_recommend = []
for i in range(len(df)):
for j in range(len(df)):
if (i!=j) and (df.iloc[i]['fav_fruit'] in df_copy.iloc[j]['top3_fruits']) and \
(df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
(df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
sample_df = sample_df.append(df_copy.iloc[[i]])
matching_id.append(df_copy.iloc[j]['ID'])
fruits_to_recommend.append(df_copy.iloc[j]['top3_fruits'])
sample_df['matching_id'] = matching_id
sample_df['fruits_to_recommend'] = fruits_to_recommend
我要查找的结果如下所示。
结果:
我正在寻找更多 feasible/faster 选项。
我的方法是使用 .explode()
method and pandas.merge()
函数。
>>> df_explode = df.copy()
>>> # copy column
>>> df_explode['fruits_to_recommend'] = df['top3_fruits']
>>> # explode list and rename column
>>> df_explode = df_explode.explode('top3_fruits').rename(columns={'ID':'matching_id'})
>>> print(df_explode)
matching_id fav_fruit age_group gender top3_fruits fruits_to_recommend
0 1A apple 35-44 male apple ['apple', 'strawberry', 'pineapple']
0 1A apple 35-44 male strawberry ['apple', 'strawberry', 'pineapple']
0 1A apple 35-44 male pineapple ['apple', 'strawberry', 'pineapple']
1 1B banana 15-24 female apple ['apple', 'banana', 'durian']
1 1B banana 15-24 female banana ['apple', 'banana', 'durian']
1 1B banana 15-24 female durian ['apple', 'banana', 'durian']
2 1C cranberry 35-44 male cranberry ['cranberry', 'apple', 'durian']
2 1C cranberry 35-44 male apple ['cranberry', 'apple', 'durian']
2 1C cranberry 35-44 male durian ['cranberry', 'apple', 'durian']
3 1D durian 15-24 female durian ['durian', 'kiwi', 'banana']
3 1D durian 15-24 female kiwi ['durian', 'kiwi', 'banana']
3 1D durian 15-24 female banana ['durian', 'kiwi', 'banana']
4 1E elderberry 35-44 male elderberry ['elderberry', 'apple', 'papaya']
4 1E elderberry 35-44 male apple ['elderberry', 'apple', 'papaya']
4 1E elderberry 35-44 male papaya ['elderberry', 'apple', 'papaya']
>>> # merging
>>> df_merged = pd.merge(df, df_explode, how='left', left_on = ['age_group', 'gender', 'fav_fruit'], right_on = ['age_group', 'gender', 'top3_fruits'], suffixes=('','_'))
>>> # select columns and filter matching_id's which are equal to ID
>>> df_merged = df_merged.loc[df_merged['ID']!=df_merged['matching_id'], list(df.columns) + ['matching_id', 'fruits_to_recommend']]
>>> print(df_merged)
ID fav_fruit age_group gender top3_fruits matching_id fruits_to_recommend
1 1A apple 35-44 male ['apple', 'strawberry', 'pineapple'] 1C ['cranberry', 'apple', 'durian']
2 1A apple 35-44 male ['apple', 'strawberry', 'pineapple'] 1E ['elderberry', 'apple', 'papaya']
4 1B banana 15-24 female ['apple', 'banana', 'durian'] 1D ['durian', 'kiwi', 'banana']
6 1D durian 15-24 female ['durian', 'kiwi', 'banana'] 1B ['apple', 'banana', 'durian']
首先检查您的 3 个条件,然后构建一个数据框,其中包含每一行的匹配行。最后把它加入回原来的df
import pandas as pd
import numpy as np
age_group_match = df.age_group.values == df.age_group.values[:, None]
gender_match = df.gender.values == df.gender.values[:, None]
fruit_match = [[ff in top3 for top3 in df.top3_fruits] for ff in df.fav_fruit]
match_res = age_group_match * gender_match * fruit_match
np.fill_diagonal(match_res, False)
df_match = (
pd.DataFrame([[df.ID[e], df.top3_fruits[e]] for e in match_res],
columns=['matching_id', 'fruits_to_recommend'])
.apply(pd.Series.explode)
.dropna()
)
df.join(df_match, how='inner')
ID fav_fruit age_group gender top3_fruits matching_id fruits_to_recommend
0 1A apple 35-44 male [apple, strawberry, pineapple] 1C [cranberry, apple, durian]
0 1A apple 35-44 male [apple, strawberry, pineapple] 1E [elderberry, apple, papaya]
1 1B banana 15-24 female [apple, banana, durian] 1D [durian, kiwi, banana]
3 1D durian 15-24 female [durian, kiwi, banana] 1B [apple, banana, durian]