pandas中的条件如何计算?
How to calculate with conditions in pandas?
我有一个这样的数据框,我想计算并添加一个遵循以下公式的新列:Value = A(where Time=1) + A(where Time=3)
,我不想使用 A(时间=5)。
Type subType Time A Value
X a 1 3 =3+9=12
X a 3 9
X a 5 9
X b 1 4 =4+5=9
X b 3 5
X b 5 0
Y a 1 1 =1+2=3
Y a 3 2
Y a 5 3
Y b 1 4 =4+5=9
Y b 3 5
Y b 5 2
我知道如何通过选择公式所需的单元格来完成,但是有没有其他更好的方法来执行计算?我怀疑我需要添加条件但不确定如何添加,有什么建议吗?
使用Series.eq
with DataFrame.groupby
and Series.cumsum
创建组并添加。
c1 = df.Time.eq(1)
c3 = df.Time.eq(3)
df['Value'] = (df.loc[c1|c3]
.groupby(c1.cumsum())
.A
.transform('sum')
.loc[c1])
print(df)
或者如果你想根据与5的不等价来识别它:
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum')
.where(c.shift(fill_value = True))
)
#Another option is map
c = df['Time'].eq(5)
c_cumsum = c.cumsum()
df['value'] = (c_cumsum.map(df['A'].mask(c)
.groupby(c_cumsum)
.sum())
.where(c.shift(fill_value = True)))
输出
Type subType Time A Value
0 X a 1 3 12.0
1 X a 3 9 NaN
2 X a 5 9 NaN
3 X b 1 4 9.0
4 X b 3 5 NaN
5 X b 5 0 NaN
6 Y a 1 1 3.0
7 Y a 3 2 NaN
8 Y a 5 3 NaN
9 Y b 1 4 9.0
10 Y b 3 5 NaN
11 Y b 5 2 NaN
缺失值
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum')
)
#or method 1
#c1 = df.Time.eq(1)
#c3 = df.Time.eq(3)
#df['Value'] = (df.loc[c1|c3]
# .groupby(c1.cumsum())
# .A
# .transform('sum')
# )
print(df)
输出
Type subType Time A value
0 X a 1 3 12.0
1 X a 3 9 12.0
2 X a 5 9 9.0
3 X b 1 4 9.0
4 X b 3 5 9.0
5 X b 5 0 3.0
6 Y a 1 1 3.0
7 Y a 3 2 3.0
8 Y a 5 3 9.0
9 Y b 1 4 9.0
10 Y b 3 5 9.0
11 Y b 5 2 0.0
或填充所有时间为 5 的除外
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum').mask(c))
#c1 = df.Time.eq(1)
#c3 = df.Time.eq(3)
#or method 1
#df['Value'] = (df.loc[c1|c3]
# .groupby(c1.cumsum())
# .A
# .transform('sum')
# .loc[c1|c3])
print(df)
Type subType Time A value
0 X a 1 3 12.0
1 X a 3 9 12.0
2 X a 5 9 NaN
3 X b 1 4 9.0
4 X b 3 5 9.0
5 X b 5 0 NaN
6 Y a 1 1 3.0
7 Y a 3 2 3.0
8 Y a 5 3 NaN
9 Y b 1 4 9.0
10 Y b 3 5 9.0
11 Y b 5 2 NaN
这里为什么不用apply?
即使在小数据帧中它也已经很慢了
%%timeit
(
df.groupby(by=['Type','subType'])
.apply(lambda x: x.loc[x.Time!=5].A.sum()) # sum time each group exclu
.to_frame('Value').reset_index()
.pipe(lambda x: pd.merge(df, x, on=['Type', 'subType'], how='left'))
)
13.6 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum')
.where(c.shift(fill_value = True))
)
3.67 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
当Time不为5时,可以用groupby对A求和,然后和原来的df合并回来。
(
df.groupby(by=['Type','subType'])
.apply(lambda x: x.loc[x.Time!=5].A.sum()) # sum time each group exclu
.to_frame('Value').reset_index()
.pipe(lambda x: pd.merge(df, x, on=['Type', 'subType'], how='left'))
)
Type subType Time A Value
0 X a 1 3 12.0
1 X a 3 9 12.0
2 X a 5 9 12.0
3 X b 1 4 9.0
4 X b 3 5 9.0
5 X b 5 0 9.0
6 Y a 1 1 3.0
7 Y a 3 2 3.0
8 Y a 5 3 3.0
9 Y b 1 4 9.0
10 Y b 3 5 9.0
11 Y b 5 2 9.0
仅使用索引和条件回答:
df.loc[df['Time'] == 1,'Value'] = (df[df['Time'] == 1].reset_index()+df[df['Time'] == 3].reset_index())['A'].values
df
Type subType Time A Value
0 X a 1 3 12.0
1 X a 3 9 NaN
2 X a 5 9 NaN
3 X b 1 4 9.0
4 X b 3 5 NaN
5 X b 5 0 NaN
6 Y a 1 1 3.0
7 Y a 3 2 NaN
8 Y a 5 3 NaN
9 Y b 1 4 9.0
10 Y b 3 5 NaN
11 Y b 5 2 NaN
我有一个这样的数据框,我想计算并添加一个遵循以下公式的新列:Value = A(where Time=1) + A(where Time=3)
,我不想使用 A(时间=5)。
Type subType Time A Value
X a 1 3 =3+9=12
X a 3 9
X a 5 9
X b 1 4 =4+5=9
X b 3 5
X b 5 0
Y a 1 1 =1+2=3
Y a 3 2
Y a 5 3
Y b 1 4 =4+5=9
Y b 3 5
Y b 5 2
我知道如何通过选择公式所需的单元格来完成,但是有没有其他更好的方法来执行计算?我怀疑我需要添加条件但不确定如何添加,有什么建议吗?
使用Series.eq
with DataFrame.groupby
and Series.cumsum
创建组并添加。
c1 = df.Time.eq(1)
c3 = df.Time.eq(3)
df['Value'] = (df.loc[c1|c3]
.groupby(c1.cumsum())
.A
.transform('sum')
.loc[c1])
print(df)
或者如果你想根据与5的不等价来识别它:
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum')
.where(c.shift(fill_value = True))
)
#Another option is map
c = df['Time'].eq(5)
c_cumsum = c.cumsum()
df['value'] = (c_cumsum.map(df['A'].mask(c)
.groupby(c_cumsum)
.sum())
.where(c.shift(fill_value = True)))
输出
Type subType Time A Value
0 X a 1 3 12.0
1 X a 3 9 NaN
2 X a 5 9 NaN
3 X b 1 4 9.0
4 X b 3 5 NaN
5 X b 5 0 NaN
6 Y a 1 1 3.0
7 Y a 3 2 NaN
8 Y a 5 3 NaN
9 Y b 1 4 9.0
10 Y b 3 5 NaN
11 Y b 5 2 NaN
缺失值
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum')
)
#or method 1
#c1 = df.Time.eq(1)
#c3 = df.Time.eq(3)
#df['Value'] = (df.loc[c1|c3]
# .groupby(c1.cumsum())
# .A
# .transform('sum')
# )
print(df)
输出
Type subType Time A value
0 X a 1 3 12.0
1 X a 3 9 12.0
2 X a 5 9 9.0
3 X b 1 4 9.0
4 X b 3 5 9.0
5 X b 5 0 3.0
6 Y a 1 1 3.0
7 Y a 3 2 3.0
8 Y a 5 3 9.0
9 Y b 1 4 9.0
10 Y b 3 5 9.0
11 Y b 5 2 0.0
或填充所有时间为 5 的除外
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum').mask(c))
#c1 = df.Time.eq(1)
#c3 = df.Time.eq(3)
#or method 1
#df['Value'] = (df.loc[c1|c3]
# .groupby(c1.cumsum())
# .A
# .transform('sum')
# .loc[c1|c3])
print(df)
Type subType Time A value
0 X a 1 3 12.0
1 X a 3 9 12.0
2 X a 5 9 NaN
3 X b 1 4 9.0
4 X b 3 5 9.0
5 X b 5 0 NaN
6 Y a 1 1 3.0
7 Y a 3 2 3.0
8 Y a 5 3 NaN
9 Y b 1 4 9.0
10 Y b 3 5 9.0
11 Y b 5 2 NaN
这里为什么不用apply?
即使在小数据帧中它也已经很慢了
%%timeit
(
df.groupby(by=['Type','subType'])
.apply(lambda x: x.loc[x.Time!=5].A.sum()) # sum time each group exclu
.to_frame('Value').reset_index()
.pipe(lambda x: pd.merge(df, x, on=['Type', 'subType'], how='left'))
)
13.6 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
c = df['Time'].eq(5)
df['value'] = (df['A'].mask(c)
.groupby(c.cumsum())
.transform('sum')
.where(c.shift(fill_value = True))
)
3.67 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
当Time不为5时,可以用groupby对A求和,然后和原来的df合并回来。
(
df.groupby(by=['Type','subType'])
.apply(lambda x: x.loc[x.Time!=5].A.sum()) # sum time each group exclu
.to_frame('Value').reset_index()
.pipe(lambda x: pd.merge(df, x, on=['Type', 'subType'], how='left'))
)
Type subType Time A Value
0 X a 1 3 12.0
1 X a 3 9 12.0
2 X a 5 9 12.0
3 X b 1 4 9.0
4 X b 3 5 9.0
5 X b 5 0 9.0
6 Y a 1 1 3.0
7 Y a 3 2 3.0
8 Y a 5 3 3.0
9 Y b 1 4 9.0
10 Y b 3 5 9.0
11 Y b 5 2 9.0
仅使用索引和条件回答:
df.loc[df['Time'] == 1,'Value'] = (df[df['Time'] == 1].reset_index()+df[df['Time'] == 3].reset_index())['A'].values
df
Type subType Time A Value
0 X a 1 3 12.0
1 X a 3 9 NaN
2 X a 5 9 NaN
3 X b 1 4 9.0
4 X b 3 5 NaN
5 X b 5 0 NaN
6 Y a 1 1 3.0
7 Y a 3 2 NaN
8 Y a 5 3 NaN
9 Y b 1 4 9.0
10 Y b 3 5 NaN
11 Y b 5 2 NaN