scikit-learn 和 mllib 的预测差异 python

Question

我对使用 Spark 2.0.0 进行二进制分类训练的 SVM 模型有疑问。我使用 scikit-learn 和 MLlib 遵循相同的逻辑，使用完全相同的数据集。对于 scikit 学习，我有以下代码：

svc_model = SVC()
svc_model.fit(X_train, y_train)

print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])

print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0])

它 returns:

supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]

Spark 正在做：

model_svm = SVMWithSGD.train(trainingData, iterations=100)

print "supposed to be 1"
print model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))

print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0))
print model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0))

其中 return 个：

supposed to be 1
1
1
1
1
supposed to be 0
1
1
1
1

试图保持我的正负类平衡我的测试数据包含 3521 条记录，我的训练数据包含 8356 条记录。对于评估，应用于 scikit-learn 模型的交叉验证给出了 98% 的准确度，对于 spark，ROC 下的面积为 0.5，PR 下的面积为 0.74，并且 0.47 训练误差。

我也曾尝试清除阈值并将其设置回 0.5，但这并没有 return 任何更好的结果。有时，当我更改训练测试拆分时，我可能会得到即除了一个正确预测之外的所有零或除了一个正确零预测之外的所有零。有谁知道如何解决这个问题？

正如我所说，我已经多次检查我的数据集在这两种情况下完全相同。

Answer 1

Your call to clearThreshold, is causing the classifier to return the raw prediction scores:

clearThreshold() Note Experimental Clears the threshold so that predict will output raw prediction scores. It is used for binary classification only.

New in version 1.4.0.

如果您只需要预测 class，请删除此函数调用。

Answer 2

您使用了不同的分类器，因此得到了不同的结果。 Sklearn的SVC是带有RBF核的SVM； SVMWithSGD 是具有使用 SGD 训练的线性内核的 SVM。他们是完全不同的。

如果你想匹配结果那么我认为要走的路是使用sklearn.linear_model.SGDClassifier(loss='hinge')并尝试匹配其他参数（正则化，是否适合截距等）因为默认值不一样.

scikit-learn 和 mllib 的预测差异 python

scikit-learn and mllib difference in predictions python

python

prediction

scikit-learn

apache-spark

apache-spark-mllib