如何在 python 的 sklearn 中减少 OneClassSVM 中的异常值数量？

Question

我正在使用一个classsvm 如下。

from sklearn.svm import OneClassSVM

clf = OneClassSVM(random_state=42)
clf.fit(X)
y_pred_train = clf.predict(X)

print(len(np.where(y_pred_train == -1)[0]))

但是，我得到超过 50% 的数据是离群值。我想知道是否有办法减少一个 class svm.

中异常值的数量

我试过了contamination。但是，似乎一个classsvm 不支持污染。

我可以使用其他方法吗？

如果需要，我很乐意提供更多详细信息。

Answer 1

我有兴趣了解您正在使用的方差、维数和样本点数量，但我的第一个建议是尝试：

clf = OneClassSVM(random_state=42, gamma='scale')

来自Docs

Current default is ‘auto’ which uses 1 / n_features, if gamma='scale' is passed then it uses 1 / (n_features * X.var()) as value of gamma. The current default of gamma, ‘auto’, will change to ‘scale’ in version 0.22. ‘auto_deprecated’, a deprecated version of ‘auto’ is used as a default indicating that no explicit value of gamma was passed.

Answer 2

您可以通过控制 OneClassSVM 的 nu 参数来控制训练数据中有多少数据点被标记为异常值。

来自 API 文档，nu 是 An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.

我建议你有一个带标签的验证集，然后调整你的 SVM 超参数，比如 nu。 kernel 等等，以便在标记的验证集上获得最佳性能。

如何在 python 的 sklearn 中减少 OneClassSVM 中的异常值数量？

How to reduce the number of outliers in OneClassSVM in sklearn in python?

python

outliers

scikit-learn