如何根据标量条件和列比较获取一列 pandas 数据?
How do I take a column of pandas data based on a scalar condition AND a column comparison?
这是一个起始的 DataFrame:
ipdb> df[["line_amount","modifiedAmount"]]
line_amount modifiedAmount
0 30.00
1 2.88 2.88
2 199.20 199.2
3 -105.00 -104
4 150.00 150
5 75.00
6 -450.00 -450
7 16.13 16.13
8 20.00
9 111.99 111.99
我想要的是一个新的数据列(或者真正用一个替换 modifiedAmount 列),在原始 modifiedAmount 是 EITHER 的情况下包含“”:
- 已经“”或
- 等于line_amount
我很难弄清楚如何完成我原以为会很容易的事情!
我可以得到这个:
ipdb> equal_test = df.modifiedAmount == df.line_amount
ipdb> blank_test = df.modifiedAmount == ""
但我做不到:
ipdb> blank_test and equal_test
*** ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
当我想应用标量结果时,我看到了 this 选项,但我不知道如何像这样将 df 放入 lambda 中:
ipdb> df.modifiedAmount.apply(lambda x: "" if x == df.line_amount else x)
*** NameError: global name 'df' is not defined
有什么想法吗?
期望的结果如下所示:
ipdb> df[["line_amount","modifiedAmount"]]
line_amount modifiedAmount
0 30.00
1 2.88
2 199.20
3 -105.00 -104.00
4 150.00
5 75.00
6 -450.00
7 16.13
8 20.00
9 111.99
(是的,理想情况下我想将任何剩余值转换为浮点数,保留两位小数)
您可以在整个数据帧上按列使用 apply
。
import pandas as pd
import numpy as np
创建一些虚拟数据并将其放入数据框中。我用 np.nan 而不是 "".
df =pd.DataFrame( { 'lineAmount':[30.00,2.88,199.20,-105.00,150.00,75.00,-450.00,16.13,20.00,111.99], \
'modifiedAmount':[np.nan,2.88,199.20,-104.00,150.00,np.nan,-450.00,16.13,np.nan,111.99]})
然后您可以使用整个数据框的 lamda 函数,使用 apply()
的 axis=1
参数逐列:
df['modifiedAmount'] =df.apply(lambda x: np.nan if x.modifiedAmount == x.lineAmount else x.modifiedAmount, axis =1)
输出:
lineAmount modifiedAmount
0 30.00 NaN
1 2.88 NaN
2 199.20 NaN
3 -105.00 -104
4 150.00 NaN
5 75.00 NaN
6 -450.00 NaN
7 16.13 NaN
8 20.00 NaN
9 111.99 NaN
构建数据集。请注意,我输入的所有数字都是浮点数(它们在您的代码中似乎是字符串)
import pandas as pd
s_dict = {'line_amount': [30, 2.88, 199.2, -105, 150, 75, -450, 16.13, 20, 111.99], 'modifiedAmount': [None,2.88,199.2,-104, 150, None, -450, 16.13, None, 111.99]}
df = pd.DataFrame.from_dict(s_dict)
print df
输出:
line_amount modifiedAmount
0 30.00 NaN
1 2.88 2.88
2 199.20 199.20
3 -105.00 -104.00
4 150.00 150.00
5 75.00 NaN
6 -450.00 -450.00
7 16.13 16.13
8 20.00 NaN
9 111.99 111.99
这一行需要一些解释。这里我们使用布尔掩码(df.modifiedAmount == df.line_amount
和pd.isnull(df.modifiedAmount)
),用|
(或比较)分隔,开头的~
表示NOT。
df['new_mod'] = df.loc[~((df.modifiedAmount == df.line_amount) | (pd.isnull(df.modifiedAmount))), 'modifiedAmount']
print df
输出:
line_amount modifiedAmount new_mod
0 30.00 NaN NaN
1 2.88 2.88 NaN
2 199.20 199.20 NaN
3 -105.00 -104.00 -104
4 150.00 150.00 NaN
5 75.00 NaN NaN
6 -450.00 -450.00 NaN
7 16.13 16.13 NaN
8 20.00 NaN NaN
9 111.99 111.99 NaN
这是一个起始的 DataFrame:
ipdb> df[["line_amount","modifiedAmount"]]
line_amount modifiedAmount
0 30.00
1 2.88 2.88
2 199.20 199.2
3 -105.00 -104
4 150.00 150
5 75.00
6 -450.00 -450
7 16.13 16.13
8 20.00
9 111.99 111.99
我想要的是一个新的数据列(或者真正用一个替换 modifiedAmount 列),在原始 modifiedAmount 是 EITHER 的情况下包含“”:
- 已经“”或
- 等于line_amount
我很难弄清楚如何完成我原以为会很容易的事情!
我可以得到这个:
ipdb> equal_test = df.modifiedAmount == df.line_amount
ipdb> blank_test = df.modifiedAmount == ""
但我做不到:
ipdb> blank_test and equal_test
*** ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
当我想应用标量结果时,我看到了 this 选项,但我不知道如何像这样将 df 放入 lambda 中:
ipdb> df.modifiedAmount.apply(lambda x: "" if x == df.line_amount else x)
*** NameError: global name 'df' is not defined
有什么想法吗?
期望的结果如下所示:
ipdb> df[["line_amount","modifiedAmount"]]
line_amount modifiedAmount
0 30.00
1 2.88
2 199.20
3 -105.00 -104.00
4 150.00
5 75.00
6 -450.00
7 16.13
8 20.00
9 111.99
(是的,理想情况下我想将任何剩余值转换为浮点数,保留两位小数)
您可以在整个数据帧上按列使用 apply
。
import pandas as pd
import numpy as np
创建一些虚拟数据并将其放入数据框中。我用 np.nan 而不是 "".
df =pd.DataFrame( { 'lineAmount':[30.00,2.88,199.20,-105.00,150.00,75.00,-450.00,16.13,20.00,111.99], \
'modifiedAmount':[np.nan,2.88,199.20,-104.00,150.00,np.nan,-450.00,16.13,np.nan,111.99]})
然后您可以使用整个数据框的 lamda 函数,使用 apply()
的 axis=1
参数逐列:
df['modifiedAmount'] =df.apply(lambda x: np.nan if x.modifiedAmount == x.lineAmount else x.modifiedAmount, axis =1)
输出:
lineAmount modifiedAmount
0 30.00 NaN
1 2.88 NaN
2 199.20 NaN
3 -105.00 -104
4 150.00 NaN
5 75.00 NaN
6 -450.00 NaN
7 16.13 NaN
8 20.00 NaN
9 111.99 NaN
构建数据集。请注意,我输入的所有数字都是浮点数(它们在您的代码中似乎是字符串)
import pandas as pd
s_dict = {'line_amount': [30, 2.88, 199.2, -105, 150, 75, -450, 16.13, 20, 111.99], 'modifiedAmount': [None,2.88,199.2,-104, 150, None, -450, 16.13, None, 111.99]}
df = pd.DataFrame.from_dict(s_dict)
print df
输出:
line_amount modifiedAmount
0 30.00 NaN
1 2.88 2.88
2 199.20 199.20
3 -105.00 -104.00
4 150.00 150.00
5 75.00 NaN
6 -450.00 -450.00
7 16.13 16.13
8 20.00 NaN
9 111.99 111.99
这一行需要一些解释。这里我们使用布尔掩码(df.modifiedAmount == df.line_amount
和pd.isnull(df.modifiedAmount)
),用|
(或比较)分隔,开头的~
表示NOT。
df['new_mod'] = df.loc[~((df.modifiedAmount == df.line_amount) | (pd.isnull(df.modifiedAmount))), 'modifiedAmount']
print df
输出:
line_amount modifiedAmount new_mod
0 30.00 NaN NaN
1 2.88 2.88 NaN
2 199.20 199.20 NaN
3 -105.00 -104.00 -104
4 150.00 150.00 NaN
5 75.00 NaN NaN
6 -450.00 -450.00 NaN
7 16.13 16.13 NaN
8 20.00 NaN NaN
9 111.99 111.99 NaN