摆脱多列 pandas 数据框中的异常行

Question

我有一个包含许多列（>100）的 pandas 数据框。我标准化了所有列的值，因此每一列都以 0 为中心（它们的平均值为 0，标准差为 1）。考虑到所有列，我想去掉所有低于 -2 和高于 2 的行。我的意思是，假设在第一列中，第 2、3、4 行是异常值，在第二列中，第 3、4、5、6 行是异常值。然后我想摆脱行 [2,3,4,5,6]。

我想做的是使用for循环传递每一列并收集异常值的行索引并将它们存储在列表中。最后我有一个列表，其中包含每列的行索引列表。我得到了唯一值以获得我应该摆脱的行索引。我的问题是我不知道如何对数据框进行切片，因此它不包含这些行。我在考虑使用 %in% 运算符，但它不接受格式#list in a list#。我在下面显示我的代码。

### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2. 
'''                                          
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []

for i in range(n_cols):
    variable = aux_features[:,i] # We take one column at a time
    condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
    index = np.where(condition)
    outliers_index.append(index)

outliers = [j for i in outliers_index for j in i]

outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.

total_index = list(range(n_rows))

aux = (total_index in unique_index)

outliers_2 包含一个包含所有行索引的列表（这包括重复），然后在 unique_index 中我只得到唯一值，所以我以所有具有异常值的行索引结束。我被困在这部分。如果有人知道如何完成它或更好地了解如何摆脱这些异常值（我想我的方法对于非常大的数据集来说会非常耗时）

Answer 1

df = pd.DataFrame(np.random.standard_normal(size=(1000, 5)))  # example data
cleaned = df[~(np.abs(df) > 2).any(1)]

解释：

为大于和小于 2 的值过滤数据框。Returns 包含布尔表达式的数据框：

np.abs(df) > 2

检查行是否包含异常值。对存在离群值的每一行求值为真：

(np.abs(df) > 2).any(1)

最后 select 所有没有离群值的行使用 ~ 运算符：

 df[~(np.abs(df) > 2).any(1)]

摆脱多列 pandas 数据框中的异常行

Getting rid of outliers rows in multiple columns pandas dataframe

python

outliers

pandas