Pandas groupby 后基于条件的新列
Pandas new column based on condition after groupby
我有一个数据集,其中分组是基于两列:代码和组。示例数据可以生成如下:
import pandas as pd
# Sample dataframe
df = pd.DataFrame({'code': [12] * 5 + [20] * 5,
'group': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'B'],
'options': ['x,y', 'x', 'x', 'y', 'y', 'z', 'z', 'x', 'y', 'z']})
print(df)
code group options
0 12 A x,y
1 12 A x
2 12 A x
3 12 B y
4 12 B y
5 20 A z
6 20 A z
7 20 B x
8 20 B y
9 20 B z
我做的第一件事是生成一个新列,其中包含每个组的所有可能选项。我无法一步完成,但我是这样做的:
# First generate a new column joining all the options by group in temporary strings
df['group_options'] = df.groupby(['code','group'])['options'].transform(lambda x: ','.join(x))
# Transform these temporary strings into lists containing unique values
df['group_options'] = df['group_options'].map(lambda x: list(set([option for temp_str in x.split(',') for option in temp_str])))
结果:
code group options group_options
0 12 A x,y [x, y]
1 12 A x [x, y]
2 12 A x [x, y]
3 12 B y [y]
4 12 B y [y]
5 20 A z [z]
6 20 A z [z]
7 20 B x [x, z, y]
8 20 B y [x, z, y]
9 20 B z [x, z, y]
现在我想生成两个新列供以后使用,group_a_options
和 group_b_options
,这些列应包含每个 code
组的数据 group_options
:
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, z, y] [z] [x, y, z]
8 20 B y [x, z, y] [z] [x, y, z]
9 20 B z [x, z, y] [z] [x, y, z]
我一直在尝试使用 groupby
和 transform
生成这个新列,但没有成功。如何将 group
列的条件添加到 groupby
以获得所需的输出?感谢任何帮助。
首先是通过 ,
的连接值创建 Series
和 set
s 并拆分,最后转换为 list
s:
s = df.groupby(['code','group'])['options'].agg(lambda x: list(set(','.join(x).split(','))))
然后按 Series.unstack
重塑并更改列 nnames:
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
两列最后一次使用 DataFrame.join
,然后是 code
列:
df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [y, x] [y, x] [y]
1 12 A x [y, x] [y, x] [y]
2 12 A x [y, x] [y, x] [y]
3 12 B y [y] [y, x] [y]
4 12 B y [y] [y, x] [y]
5 20 A z [z] [z] [y, x, z]
6 20 A z [z] [z] [y, x, z]
7 20 B x [y, x, z] [z] [y, x, z]
8 20 B y [y, x, z] [z] [y, x, z]
9 20 B z [y, x, z] [z] [y, x, z]
如果排序很重要,则通过 dict.fromkeys
技巧删除重复值:
s = (df.groupby(['code','group'])['options']
.agg(lambda x: list(dict.fromkeys(','.join(x).split(',')))))
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
df = df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, y, z] [z] [x, y, z]
8 20 B y [x, y, z] [z] [x, y, z]
9 20 B z [x, y, z] [z] [x, y, z]
我有一个数据集,其中分组是基于两列:代码和组。示例数据可以生成如下:
import pandas as pd
# Sample dataframe
df = pd.DataFrame({'code': [12] * 5 + [20] * 5,
'group': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'B'],
'options': ['x,y', 'x', 'x', 'y', 'y', 'z', 'z', 'x', 'y', 'z']})
print(df)
code group options
0 12 A x,y
1 12 A x
2 12 A x
3 12 B y
4 12 B y
5 20 A z
6 20 A z
7 20 B x
8 20 B y
9 20 B z
我做的第一件事是生成一个新列,其中包含每个组的所有可能选项。我无法一步完成,但我是这样做的:
# First generate a new column joining all the options by group in temporary strings
df['group_options'] = df.groupby(['code','group'])['options'].transform(lambda x: ','.join(x))
# Transform these temporary strings into lists containing unique values
df['group_options'] = df['group_options'].map(lambda x: list(set([option for temp_str in x.split(',') for option in temp_str])))
结果:
code group options group_options
0 12 A x,y [x, y]
1 12 A x [x, y]
2 12 A x [x, y]
3 12 B y [y]
4 12 B y [y]
5 20 A z [z]
6 20 A z [z]
7 20 B x [x, z, y]
8 20 B y [x, z, y]
9 20 B z [x, z, y]
现在我想生成两个新列供以后使用,group_a_options
和 group_b_options
,这些列应包含每个 code
组的数据 group_options
:
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, z, y] [z] [x, y, z]
8 20 B y [x, z, y] [z] [x, y, z]
9 20 B z [x, z, y] [z] [x, y, z]
我一直在尝试使用 groupby
和 transform
生成这个新列,但没有成功。如何将 group
列的条件添加到 groupby
以获得所需的输出?感谢任何帮助。
首先是通过 ,
的连接值创建 Series
和 set
s 并拆分,最后转换为 list
s:
s = df.groupby(['code','group'])['options'].agg(lambda x: list(set(','.join(x).split(','))))
然后按 Series.unstack
重塑并更改列 nnames:
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
两列最后一次使用 DataFrame.join
,然后是 code
列:
df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [y, x] [y, x] [y]
1 12 A x [y, x] [y, x] [y]
2 12 A x [y, x] [y, x] [y]
3 12 B y [y] [y, x] [y]
4 12 B y [y] [y, x] [y]
5 20 A z [z] [z] [y, x, z]
6 20 A z [z] [z] [y, x, z]
7 20 B x [y, x, z] [z] [y, x, z]
8 20 B y [y, x, z] [z] [y, x, z]
9 20 B z [y, x, z] [z] [y, x, z]
如果排序很重要,则通过 dict.fromkeys
技巧删除重复值:
s = (df.groupby(['code','group'])['options']
.agg(lambda x: list(dict.fromkeys(','.join(x).split(',')))))
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
df = df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, y, z] [z] [x, y, z]
8 20 B y [x, y, z] [z] [x, y, z]
9 20 B z [x, y, z] [z] [x, y, z]