Pandas 具有复合表达式行为的掩码

Pandas mask with composite expression behaviour

这个问题之前曾被用户问过(然后删除了),我正在寻找一个解决方案,以便在问题消失时给出答案,而且我似乎无法理解pandas' 的行为,所以我会很清楚一些,原来的问题是这样说的:

How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?

我重现该场景的设置如下:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A' : [x for x in range(4)],
    'B' : [x for x in range(-2, 2)]
})

这在技术上应该只是将布尔表达式正确传递给 pd.where 的问题,我尝试的解决方案如下:

df[df >= 0 | df.isin([-2])] 

产生:

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

这也取消了列表中的号码!

此外,如果我用两个条件中的每一个屏蔽数据帧,我就会得到正确的行为:

with df[df >= 0](与复合结果相同)

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

df[df.isin([-2])](与复合结果相同)

index A B
0 NaN -2.0
1 NaN NaN
2 NaN NaN
3 NaN NaN

看来我是

  1. 运行 由于对 NaN 值执行逻辑而导致一些未定义的行为
  2. 我有问题

谁能给我解释一下这个情况?

解决方案

df[(df >= 0) | (df.isin([-2]))] 

说明

在python中,按位或,|,比>=这样的比较运算符具有更高的运算符优先级:https://docs.python.org/3/reference/expressions.html#operator-precedence

在多个布尔条件下过滤 pandas DataFrame 时,您需要将每个条件括在括号中。来自 boolean indexing section of the pandas user guide 的更多内容:

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).