pandas:连接数据框时如何聚合两个列表列
pandas: how to aggregate two list columns when joining data frames
我有以下两个数据框:
id websites
-- ---
0 1 [cnn.com, bbc.com]
1 2 [ebay.com, facebook.com]
________________
id websites
-- ---
0 2 [google.com, facebook.com]
1 3 [amazon.com, youtube.com]
我想通过聚合匹配行的唯一 websites
将它们外连接到 id
列。输出应如下所示:
id websites
-- ---
0 1 [cnn.com, bbc.com]
1 2 [ebay.com, facebook.com, google.com]
2 3 [amazon.com, youtube.com]
到目前为止我已经尝试了以下方法:
import pandas as pd
df_a = pd.DataFrame({'id':[1,2],'websites':[['cnn.com','bbc.com'],['ebay.com','facebook.com']]})
df_b = pd.DataFrame({'id':[2,3],'websites':[['google.com','facebook.com'],['amazon.com','youtube.com']]})
df_a.merge(df_b, on='id', how='outer')
这给了我以下输出:
id websites_x websites_y
-- --- ---
0 1 [cnn.com, bbc.com] NaN
1 2 [ebay.com, facebook.com] [google.com, facebook.com]
2 3 NaN [amazon.com, youtube.com]
您可以连接它们,然后在 id
列上分组:
df_a = pd.DataFrame({'id':[1,2],'websites':[['cnn.com','bbc.com'],
['ebay.com','facebook.com']]})
df_b = pd.DataFrame({'id':[2,3],'websites':[['google.com','facebook.com'],
['amazon.com','youtube.com']]})
解决方案:
方法一:
a = df_a.explode('websites') #requires pandas version 0.25+
b = df_b.explode('websites') #requires pandas version 0.25+
out = pd.concat((a,b)).groupby('id')['websites'].apply(pd.unique).reset_index()
#or out = pd.concat((a,b)).groupby('id')['websites'].agg(set).reset_index()
print(out)
方法二:
另一种使用 itertools.chain.from_iterable
的解决方案不需要分解数据帧:
from itertools import chain
out = (pd.concat((df_a,df_b)).groupby('id')['websites']
.apply(lambda x : dict.fromkeys(chain.from_iterable(x)).keys()).reset_index())
print (out)
id websites
0 1 [cnn.com, bbc.com]
1 2 [ebay.com, facebook.com, google.com]
2 3 [amazon.com, youtube.com]
我有以下两个数据框:
id websites
-- ---
0 1 [cnn.com, bbc.com]
1 2 [ebay.com, facebook.com]
________________
id websites
-- ---
0 2 [google.com, facebook.com]
1 3 [amazon.com, youtube.com]
我想通过聚合匹配行的唯一 websites
将它们外连接到 id
列。输出应如下所示:
id websites
-- ---
0 1 [cnn.com, bbc.com]
1 2 [ebay.com, facebook.com, google.com]
2 3 [amazon.com, youtube.com]
到目前为止我已经尝试了以下方法:
import pandas as pd
df_a = pd.DataFrame({'id':[1,2],'websites':[['cnn.com','bbc.com'],['ebay.com','facebook.com']]})
df_b = pd.DataFrame({'id':[2,3],'websites':[['google.com','facebook.com'],['amazon.com','youtube.com']]})
df_a.merge(df_b, on='id', how='outer')
这给了我以下输出:
id websites_x websites_y
-- --- ---
0 1 [cnn.com, bbc.com] NaN
1 2 [ebay.com, facebook.com] [google.com, facebook.com]
2 3 NaN [amazon.com, youtube.com]
您可以连接它们,然后在 id
列上分组:
df_a = pd.DataFrame({'id':[1,2],'websites':[['cnn.com','bbc.com'],
['ebay.com','facebook.com']]})
df_b = pd.DataFrame({'id':[2,3],'websites':[['google.com','facebook.com'],
['amazon.com','youtube.com']]})
解决方案:
方法一:
a = df_a.explode('websites') #requires pandas version 0.25+
b = df_b.explode('websites') #requires pandas version 0.25+
out = pd.concat((a,b)).groupby('id')['websites'].apply(pd.unique).reset_index()
#or out = pd.concat((a,b)).groupby('id')['websites'].agg(set).reset_index()
print(out)
方法二:
另一种使用 itertools.chain.from_iterable
的解决方案不需要分解数据帧:
from itertools import chain
out = (pd.concat((df_a,df_b)).groupby('id')['websites']
.apply(lambda x : dict.fromkeys(chain.from_iterable(x)).keys()).reset_index())
print (out)
id websites
0 1 [cnn.com, bbc.com]
1 2 [ebay.com, facebook.com, google.com]
2 3 [amazon.com, youtube.com]