如何删除异常值
How to remove outlier
我正在研究一个回归问题。我有 10 个独立 variables.I'' 使用 SVR。尽管使用网格搜索进行了特征选择和调整 SVR 参数,但我得到了 15% 的巨大 MAPE。所以我试图删除异常值,但删除它们后我无法拆分数据。我的问题是异常值会影响回归的准确性吗?
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import Normalizer
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
import pandas as pd
from sklearn import preprocessing
features=pd.read_csv('selectedData.csv')
target = features['SYSLoad']
features= features.drop('SYSLoad', axis = 1)
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(features))
print(z)
threshold = 3
print(np.where(z > 3))
features2 = features[(z < 3).all(axis=1)]
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(features2, target, test_size = 0.25, random_state = 42)
在执行以下代码时出现此错误。
"samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of
samples: [33352, 35064]"
您收到错误是因为,虽然您的 target
变量与 features
的长度相等(大概是 35064),但由于:
target = features['SYSLoad']
您的 features2
变量的长度较短(大概是 33352),即它是 features
的 子集 ,由于:
features2 = features[(z < 3).all(axis=1)]
而您的 train_test_split
有理由抱怨特征和标签的长度不相等。
因此,您还应该相应地对 target
进行子集化,并在 train_test_split
:
中使用此 target2
target2 = target[(z < 3).all(axis=1)]
train_input, test_input, train_target, test_target = train_test_split(features2, target2, test_size = 0.25, random_state = 42)
我正在研究一个回归问题。我有 10 个独立 variables.I'' 使用 SVR。尽管使用网格搜索进行了特征选择和调整 SVR 参数,但我得到了 15% 的巨大 MAPE。所以我试图删除异常值,但删除它们后我无法拆分数据。我的问题是异常值会影响回归的准确性吗?
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import Normalizer
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
import pandas as pd
from sklearn import preprocessing
features=pd.read_csv('selectedData.csv')
target = features['SYSLoad']
features= features.drop('SYSLoad', axis = 1)
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(features))
print(z)
threshold = 3
print(np.where(z > 3))
features2 = features[(z < 3).all(axis=1)]
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(features2, target, test_size = 0.25, random_state = 42)
在执行以下代码时出现此错误。
"samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [33352, 35064]"
您收到错误是因为,虽然您的 target
变量与 features
的长度相等(大概是 35064),但由于:
target = features['SYSLoad']
您的 features2
变量的长度较短(大概是 33352),即它是 features
的 子集 ,由于:
features2 = features[(z < 3).all(axis=1)]
而您的 train_test_split
有理由抱怨特征和标签的长度不相等。
因此,您还应该相应地对 target
进行子集化,并在 train_test_split
:
target2
target2 = target[(z < 3).all(axis=1)]
train_input, test_input, train_target, test_target = train_test_split(features2, target2, test_size = 0.25, random_state = 42)