scikit 中的 OCSVM：异常值的距离始终为负

Question

我正在使用 Scikit 中的 class SVM classifier OneClassSVM 来确定数据集中的异常值。我的数据集有 30000 个样本和 1024 个变量。我使用其中的 10% 作为训练数据。

clf=svm.OneClassSVM(nu=0.001,kernel="rbf",gamma=1e-5)
clf.fit(trset)
dist2hptr=clf.decision_function(trset)
tr_y=clf.predict(trset)

如上，我使用decision_function(x)函数计算每个样本到决策函数的距离。当我比较预测结果和距离结果时，它总是在预测输出中显示标记为 +1 的样本的正距离值和标记为 -1 的样本的负距离值。

我认为距离没有符号，因为它与方向无关。我想了解如何在 OneClassSV scikit classifier 中计算距离。符号是否简单地表示样本位于 SVM 计算的决策超平面之外？

请帮忙。

Answer 1

sklearn's OneClassSVM is implemented from the following paper as explained here:

Bernhard Schölkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. 2001. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 13, 7 (July 2001), 1443-1471. DOI: https://doi.org/10.1162/089976601750264965

让我们看一下那篇论文的摘要here:

Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.

We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.

所以摘要定义了OneClassSVM的函数f，后面是sklearn。

scikit 中的 OCSVM：异常值的距离始终为负

OCSVM in scikit: distance of outlier is always negative

python

svm

python-3.x

scikit-learn