在一列内检测行范围内的异常值

Question

在给定的数据框中，我有这两列：

 neighbourhood_group
 price

价格列包含所有 neighbourhood_group:

的所有价格

    neighbourhood_group price
 0  Brooklyn            149
 1  Manhattan           225
 2  Manhattan           150
 3  Brooklyn            89
 4  Manhattan           80
 5  Manhattan           200
 6  Brooklyn            60
 7  Manhattan           79
 8  Manhattan           79
 9  Manhattan           150

我正在尝试检测每个 neighbourhood_group 的异常值。

到目前为止，我唯一的想法是按 neighbourhood_group 的价格分组，检测每组中的异常值并为需要删除的行创建掩码。

 data.groupby('neighbourhood_group')['price']

我怀疑可能有更简单的解决方案。

Answer 1

您可以使用 Groupby.apply 然后获取所有超出 3 * std 范围的值，同时用 mean:

减去每个值

m = df.groupby('neighbourhood_group')['price'].apply(lambda x: x.sub(x.mean()).abs() <= (x.std()*3))

df[m]

输出

  neighbourhood_group  price
0            Brooklyn    149
1           Manhattan    225
2           Manhattan    150
3            Brooklyn     89
4           Manhattan     80
5           Manhattan    200
6            Brooklyn     60
7           Manhattan     79
8           Manhattan     79
9           Manhattan    150

注意：在这种情况下，我们会返回所有行，因为没有异常值。

Answer 2

我认为使用 groupby 非常有意义。然后我会得到单个组，例如使用 get_group 方法。最后你可以做任何你需要的分析，如果你错过了，请看这个例子

Detect and exclude outliers in Pandas data frame

干杯，干得好，我也很感兴趣，所以我会关注这个问题

Answer 3

我会手动操作一下：

假设你的 df 是这样的（注意我在底部添加了 2 行）

    neighbourhood_group price
0   Brooklyn    149
1   Manhattan   225
2   Manhattan   150
3   Brooklyn    89
4   Manhattan   80
5   Manhattan   200
6   Brooklyn    60
7   Manhattan   79
8   Manhattan   79
9   Manhattan   150
10  Manhattan   28
11  Manhattan   280

让我们添加 2 列以方便此处：

df['mean']=df.groupby('neighbourhood_group').transform('mean')
df['std'] = df.groupby('neighbourhood_group')['price'].transform('std')

如果is_outlier

，我们求true/false

df['is_outlier'] = df.apply(lambda x: x['price']+x['std']<x['mean'] or x['price']-x['std']>x['mean'], axis=1)

结果

    neighbourhood_group price   mean              std   is_outlier
0   Brooklyn            149     99.333333   45.390895   True
1   Manhattan           225     141.222222  82.308532   True
2   Manhattan           150     141.222222  82.308532   False
3   Brooklyn            89      99.333333   45.390895   False
4   Manhattan           80      141.222222  82.308532   False
5   Manhattan           200     141.222222  82.308532   False
6   Brooklyn            60      99.333333   45.390895   False
7   Manhattan           79      141.222222  82.308532   False
8   Manhattan           79      141.222222  82.308532   False
9   Manhattan           150     141.222222  82.308532   False
0   Manhattan           28      141.222222  82.308532   True
1   Manhattan           280     141.222222  82.308532   True

另外：@Willem Van Onsem 注意到 'outlier' 的定义通常是 3 sigma above/below 的平均值。在你的工作中考虑这一点，你可以定义你与平均值的偏差（我使用 std=1）

在一列内检测行范围内的异常值

Detecting outliers within one column for ranges of rows

python

outliers

pandas