如何根据标量条件和列比较获取一列 pandas 数据?

How do I take a column of pandas data based on a scalar condition AND a column comparison?

这是一个起始的 DataFrame:

ipdb> df[["line_amount","modifiedAmount"]]
   line_amount modifiedAmount
0        30.00               
1         2.88           2.88
2       199.20          199.2
3      -105.00           -104
4       150.00            150
5        75.00               
6      -450.00           -450
7        16.13          16.13
8        20.00               
9       111.99         111.99

我想要的是一个新的数据列(或者真正用一个替换 modifiedAmount 列),在原始 modifiedAmount 是 EITHER 的情况下包含“”:

我很难弄清楚如何完成我原以为会很容易的事情!

我可以得到这个:

ipdb> equal_test = df.modifiedAmount == df.line_amount
ipdb> blank_test = df.modifiedAmount == ""

但我做不到:

ipdb> blank_test and equal_test
*** ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

当我想应用标量结果时,我看到了 this 选项,但我不知道如何像这样将 df 放入 lambda 中:

ipdb> df.modifiedAmount.apply(lambda x: "" if x == df.line_amount else x)
*** NameError: global name 'df' is not defined

有什么想法吗?

期望的结果如下所示:

ipdb> df[["line_amount","modifiedAmount"]]
   line_amount modifiedAmount
0        30.00               
1         2.88         
2       199.20         
3      -105.00         -104.00
4       150.00         
5        75.00               
6      -450.00         
7        16.13         
8        20.00               
9       111.99         

(是的,理想情况下我想将任何剩余值转换为浮点数,保留两位小数)

您可以在整个数据帧上按列使用 apply

import pandas as pd
import numpy as np

创建一些虚拟数据并将其放入数据框中。我用 np.nan 而不是 "".

df =pd.DataFrame( { 'lineAmount':[30.00,2.88,199.20,-105.00,150.00,75.00,-450.00,16.13,20.00,111.99], \
                'modifiedAmount':[np.nan,2.88,199.20,-104.00,150.00,np.nan,-450.00,16.13,np.nan,111.99]})

然后您可以使用整个数据框的 lamda 函数,使用 apply()axis=1 参数逐列:

df['modifiedAmount'] =df.apply(lambda x: np.nan if x.modifiedAmount == x.lineAmount else x.modifiedAmount, axis =1)

输出:

    lineAmount  modifiedAmount
0   30.00       NaN
1   2.88        NaN
2   199.20      NaN
3   -105.00     -104
4   150.00      NaN
5   75.00       NaN
6   -450.00     NaN
7   16.13       NaN
8   20.00       NaN
9   111.99      NaN

构建数据集。请注意,我输入的所有数字都是浮点数(它们在您的代码中似乎是字符串)

import pandas as pd
s_dict = {'line_amount': [30, 2.88, 199.2, -105, 150, 75, -450, 16.13, 20, 111.99], 'modifiedAmount': [None,2.88,199.2,-104, 150, None, -450, 16.13, None, 111.99]}
df = pd.DataFrame.from_dict(s_dict)
print df

输出:

   line_amount  modifiedAmount
0        30.00             NaN
1         2.88            2.88
2       199.20          199.20
3      -105.00         -104.00
4       150.00          150.00
5        75.00             NaN
6      -450.00         -450.00
7        16.13           16.13
8        20.00             NaN
9       111.99          111.99

这一行需要一些解释。这里我们使用布尔掩码(df.modifiedAmount == df.line_amountpd.isnull(df.modifiedAmount)),用|(或比较)分隔,开头的~表示NOT。

df['new_mod'] = df.loc[~((df.modifiedAmount == df.line_amount) | (pd.isnull(df.modifiedAmount))), 'modifiedAmount']
print df

输出:

   line_amount  modifiedAmount  new_mod
0        30.00             NaN      NaN
1         2.88            2.88      NaN
2       199.20          199.20      NaN
3      -105.00         -104.00     -104
4       150.00          150.00      NaN
5        75.00             NaN      NaN
6      -450.00         -450.00      NaN
7        16.13           16.13      NaN
8        20.00             NaN      NaN
9       111.99          111.99      NaN