Pandas 两列之和 - 正确处理 nan 值
Pandas sum of two columns - dealing with nan-values correctly
当对两个 pandas 列求和时,当两列之一是浮点数时,我想忽略 nan 值。但是,当 nan 出现在两列中时,我想在输出中保留 nan(而不是 0.0)。
初始数据帧:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
期望输出:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
尝试过的代码:
-> 下面的代码忽略了 nan 值,但是当取两个 nan 值的总和时,它在输出中给出 0.0,在这种特殊情况下我想将它保留为 NaN,以使这些空值与实际为 0 的值分开求和后
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
您可以通过以下方式 mask
结果:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
你可以这样做:
df['Sum'] = df.dropna(how='all').sum(1)
输出:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
您可以使用min_count
,这将在至少有一个不为空的情况下对所有行求和,如果全部为空return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
来自documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
将您的代码更改为
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
输出
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
我认为上面列出的所有解决方案仅适用于缺少第一个列值的情况。如果遇到第一列值不丢失但第二列值丢失的情况,请尝试使用:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']
当对两个 pandas 列求和时,当两列之一是浮点数时,我想忽略 nan 值。但是,当 nan 出现在两列中时,我想在输出中保留 nan(而不是 0.0)。
初始数据帧:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
期望输出:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
尝试过的代码: -> 下面的代码忽略了 nan 值,但是当取两个 nan 值的总和时,它在输出中给出 0.0,在这种特殊情况下我想将它保留为 NaN,以使这些空值与实际为 0 的值分开求和后
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
您可以通过以下方式 mask
结果:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
你可以这样做:
df['Sum'] = df.dropna(how='all').sum(1)
输出:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
您可以使用min_count
,这将在至少有一个不为空的情况下对所有行求和,如果全部为空return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
来自documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
将您的代码更改为
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
输出
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
我认为上面列出的所有解决方案仅适用于缺少第一个列值的情况。如果遇到第一列值不丢失但第二列值丢失的情况,请尝试使用:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']