在组中合并 pandas 列(本身就是一个系列)的系列
Merge Series of a pandas column (which is a Series itself) in groups
我有一个 pandas 数据框,其中一列是一个系列本身。例如:
df.head()
Col1 Col2
1 ["name1","name2","name3"]
1 ["name3","name2","name4"]
2 ["name1","name2","name3"]
2 ["name1","name5","name6"]
我需要将 Col2 串联成 Col1 组。我想要类似
的东西
Col1 Col2
1 ["name1","name2","name3","name4"]
2 ["name1","name2","name3","name5","name6"]
我尝试使用 groupby 作为
.agg({"Col2":lambda x: pd.Series.append(x)})
但这会引发错误,提示需要两个参数。我还尝试在 agg 函数中使用 sum 。失败并没有减少错误。
我该怎么做?
是的,您不能对这样的分类数据使用 .aggby{}
。无论如何,这是我在问题上的尝试,使用 numpy 的帮助。 (为清楚起见进行评论)
import numpy as np
# Set group by ("Col1") unique values
groupby = df["Col1"].unique()
# Create empty dict to store values on each iteration
d = {}
for i,val in enumerate(groupby):
# Set "Col1" key, to the unique value (e.g., 1)
d.setdefault("Col1",[]).append(val)
# Create empty list to store values from "Col2"
col2_unis=[]
# Create sub-DataFrame for each unique groupby value
sdf = df.loc[df["Col1"]==val]
# Loop through the 2D-array/Series "Col2" and append each
# value to col_unis (using list comprehension)
col2_unis.append([[j for j in array] for i,array in enumerate(sdf["Col2"].values)])
# Set "Col2" key, to be unique values of col2_unis
d.setdefault("Col2",[]).append(np.unique(col2_unis))
new_df = pd.DataFrame(d)
print(new_df)
更精简的版本如下所示:
d = {}
for i,val in enumerate(df["Col1"].unique()):
d.setdefault("Col1",[]).append(val)
sdf = df.loc[df["Col1"]==val]
d.setdefault("Col2",[]).append(np.unique([[j for j in array] for i,array in enumerate(df.loc[df["Col1"]==val, "Col2"].values)]))
new_df = pd.DataFrame(d)
print(new_df)
查看 this related SO question。
了解有关 Python 的 .setdefault()
字典功能的更多信息
您可以使用 groupby
with apply
custom function, where first flatten nested lists by chain
(fastest solution),然后通过 set
删除重复项,转换为 list
最后排序:
import pandas as pd
from itertools import chain
df = pd.DataFrame({'Col1':[1,1,2,2],
'Col2':[["name1","name2","name3"],
["name3","name2","name4"],
["name1","name2","name3"],
["name1","name5","name6"]]})
print (df)
Col1 Col2
0 1 [name1, name2, name3]
1 1 [name3, name2, name4]
2 2 [name1, name2, name3]
3 2 [name1, name5, name6]
print (df.groupby('Col1')['Col2']
.apply(lambda x: sorted(list(set(list(chain.from_iterable(x))))))
.reset_index())
Col1 Col2
0 1 [name1, name2, name3, name4]
1 2 [name1, name2, name3, name5, name6]
解决方案可以更简单,只需要chain
、set
和sorted
:
print (df.groupby('Col1')['Col2']
.apply(lambda x: sorted(set(chain.from_iterable(x))))
.reset_index())
Col1 Col2
0 1 [name1, name2, name3, name4]
1 2 [name1, name2, name3, name5, name6]
我有一个 pandas 数据框,其中一列是一个系列本身。例如:
df.head()
Col1 Col2
1 ["name1","name2","name3"]
1 ["name3","name2","name4"]
2 ["name1","name2","name3"]
2 ["name1","name5","name6"]
我需要将 Col2 串联成 Col1 组。我想要类似
的东西Col1 Col2
1 ["name1","name2","name3","name4"]
2 ["name1","name2","name3","name5","name6"]
我尝试使用 groupby 作为
.agg({"Col2":lambda x: pd.Series.append(x)})
但这会引发错误,提示需要两个参数。我还尝试在 agg 函数中使用 sum 。失败并没有减少错误。
我该怎么做?
是的,您不能对这样的分类数据使用 .aggby{}
。无论如何,这是我在问题上的尝试,使用 numpy 的帮助。 (为清楚起见进行评论)
import numpy as np
# Set group by ("Col1") unique values
groupby = df["Col1"].unique()
# Create empty dict to store values on each iteration
d = {}
for i,val in enumerate(groupby):
# Set "Col1" key, to the unique value (e.g., 1)
d.setdefault("Col1",[]).append(val)
# Create empty list to store values from "Col2"
col2_unis=[]
# Create sub-DataFrame for each unique groupby value
sdf = df.loc[df["Col1"]==val]
# Loop through the 2D-array/Series "Col2" and append each
# value to col_unis (using list comprehension)
col2_unis.append([[j for j in array] for i,array in enumerate(sdf["Col2"].values)])
# Set "Col2" key, to be unique values of col2_unis
d.setdefault("Col2",[]).append(np.unique(col2_unis))
new_df = pd.DataFrame(d)
print(new_df)
更精简的版本如下所示:
d = {}
for i,val in enumerate(df["Col1"].unique()):
d.setdefault("Col1",[]).append(val)
sdf = df.loc[df["Col1"]==val]
d.setdefault("Col2",[]).append(np.unique([[j for j in array] for i,array in enumerate(df.loc[df["Col1"]==val, "Col2"].values)]))
new_df = pd.DataFrame(d)
print(new_df)
查看 this related SO question。
了解有关 Python 的.setdefault()
字典功能的更多信息
您可以使用 groupby
with apply
custom function, where first flatten nested lists by chain
(fastest solution),然后通过 set
删除重复项,转换为 list
最后排序:
import pandas as pd
from itertools import chain
df = pd.DataFrame({'Col1':[1,1,2,2],
'Col2':[["name1","name2","name3"],
["name3","name2","name4"],
["name1","name2","name3"],
["name1","name5","name6"]]})
print (df)
Col1 Col2
0 1 [name1, name2, name3]
1 1 [name3, name2, name4]
2 2 [name1, name2, name3]
3 2 [name1, name5, name6]
print (df.groupby('Col1')['Col2']
.apply(lambda x: sorted(list(set(list(chain.from_iterable(x))))))
.reset_index())
Col1 Col2
0 1 [name1, name2, name3, name4]
1 2 [name1, name2, name3, name5, name6]
解决方案可以更简单,只需要chain
、set
和sorted
:
print (df.groupby('Col1')['Col2']
.apply(lambda x: sorted(set(chain.from_iterable(x))))
.reset_index())
Col1 Col2
0 1 [name1, name2, name3, name4]
1 2 [name1, name2, name3, name5, name6]