处理 Pandas 中的异常值

Dealing with outliers in Pandas

美好的一天。问题如下 - 当试图从 table

中的一列中删除离群值时
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
import numpy as np

df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv")
df["ConvertedComp"].plot(kind="box", figsize=(10,10))
z_scores = stats.zscore(df["ConvertedComp"])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
new_df = df[filtered_entries]

以下错误崩溃。

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-133-7811da442811> in <module>
      4 z_scores
      5 abs_z_scores = np.abs(z_scores)
----> 6 filtered_entries = (abs_z_scores < 3).all(axis=1)
      7 #new_df = df[filtered_entries]

C:\ProgramData\WatsonStudioDesktop\miniconda3\envs\desktop\lib\site-packages\numpy\core\_methods.py in _all(a, axis, dtype, out, keepdims)
     44 
     45 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 46     return umr_all(a, axis, dtype, out, keepdims)
     47 
     48 def _count_reduce_items(arr, axis):

AxisError: axis 1 is out of bounds for array of dimension 1

多谢指教,思路差不多了

您的 zscore 仅在 1 列上计算,因此结果是一个一维数组

z_scores = stats.zscore(df["ConvertedComp"])
new_df = df[np.abs(z_scores) < 3]

现在,如果您 运行 zscore 处理多个列,那么您的原始代码会起作用:

z_scores = stats.zscore(df[["ConvertedComp", 'AnotherColumn']])
new_df = df[(np.abs(z_scores) < 3).all(axis=1)]