Scala Spark 模型转换 returns 全零
Scala Spark model transform returns all zeroes
大家好。首先,我正在使用 apache-spark ml(不是 mllib)和 scala 做简单的机器学习任务。我的 build.sbt 如下:
name := "spark"
version := "1.0"
scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
libraryDependencies += "com.databricks" %% "spark-csv" % "1.0.1"
所有阶段都做得很好。但是应该包含预测的数据集存在问题。在我的例子中,我正在对三个 类 进行分类,标签是 1.0, 2.0, 3.0
,但预测列仅包含 0.0
个标签,即使根本没有这样的标签。
这是原始数据框:
+--------------------+--------+
| tfIdf|estimate|
+--------------------+--------+
|(3000,[0,1,8,14,1...| 3.0|
|(3000,[0,1707,223...| 3.0|
|(3000,[1,24,33,64...| 3.0|
|(3000,[1,40,114,5...| 2.0|
|(3000,[1,363,743,...| 2.0|
|(3000,[2,20,65,88...| 3.0|
|(3000,[3,15,21,23...| 3.0|
|(3000,[3,45,53,14...| 3.0|
|(3000,[3,387,433,...| 1.0|
|(3000,[3,523,629,...| 3.0|
+--------------------+--------+
分类后,我的预测是:
+--------------------+--------+----------+
| tfIdf|estimate|prediction|
+--------------------+--------+----------+
|(3000,[0,1,8,14,1...| 3.0| 0.0|
|(3000,[0,1707,223...| 3.0| 0.0|
|(3000,[1,24,33,64...| 3.0| 0.0|
|(3000,[1,40,114,5...| 2.0| 0.0|
|(3000,[1,363,743,...| 2.0| 0.0|
|(3000,[2,20,65,88...| 3.0| 0.0|
|(3000,[3,15,21,23...| 3.0| 0.0|
|(3000,[3,45,53,14...| 3.0| 0.0|
|(3000,[3,387,433,...| 1.0| 0.0|
|(3000,[3,523,629,...| 3.0| 0.0|
+--------------------+--------+----------+
我的代码如下:
val toDouble = udf[Double, String](_.toDouble)
val kribrumData = krData.withColumn("estimate", toDouble(krData("estimate")))
.select($"text",$"estimate")
kribrumData.cache()
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("tokens")
val stopWordsRemover = new StopWordsRemover()
.setInputCol("tokens")
.setOutputCol("filtered")
.setStopWords(STOP_WORDS)
val hashingTF = new HashingTF()
.setInputCol("filtered")
.setNumFeatures(3000)
.setOutputCol("tf")
val idf = new IDF()
.setInputCol("tf")
.setOutputCol("tfIdf")
val preprocessor = new Pipeline()
.setStages(Array(tokenizer,stopWordsRemover,hashingTF,idf))
val preprocessor_model = preprocessor.fit(kribrumData)
val preprocessedKribrumData = preprocessor_model.transform(kribrumData)
.select("tfIdf", "estimate")
var Array(train, test) = preprocessedKribrumData.randomSplit(Array(0.8, 0.2), seed = 7)
test.show(10)
val logisticRegressor = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setLabelCol("estimate")
.setFeaturesCol("tfIdf")
val classifier = new OneVsRest()
.setLabelCol("estimate")
.setFeaturesCol("tfIdf")
.setClassifier(logisticRegressor)
val model = classifier.fit(train)
val predictions = model.transform(test)
predictions.show(10)
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("accuracy").setLabelCol("estimate")
val accuracy = evaluator.evaluate(predictions)
println("Classification accuracy" + accuracy.toString)
此代码最终激发预测准确度为零(因为目标列中没有标签“0.0”"estimate")。那么,我到底做错了什么?任何想法将不胜感激。
终于找到问题所在了。 Spark 不会抛出错误或异常,当标签字段为 double,但标签不在分类器的有效范围内时,需要使用 StringIndexer 来克服这种用法,因此我们只需要在管道中添加:
val labelIndexer = new StringIndexer()
.setInputCol("estimate")
.setOutputCol("indexedLabel")
这一步解决了问题,但是不方便。
大家好。首先,我正在使用 apache-spark ml(不是 mllib)和 scala 做简单的机器学习任务。我的 build.sbt 如下:
name := "spark"
version := "1.0"
scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
libraryDependencies += "com.databricks" %% "spark-csv" % "1.0.1"
所有阶段都做得很好。但是应该包含预测的数据集存在问题。在我的例子中,我正在对三个 类 进行分类,标签是 1.0, 2.0, 3.0
,但预测列仅包含 0.0
个标签,即使根本没有这样的标签。
这是原始数据框:
+--------------------+--------+
| tfIdf|estimate|
+--------------------+--------+
|(3000,[0,1,8,14,1...| 3.0|
|(3000,[0,1707,223...| 3.0|
|(3000,[1,24,33,64...| 3.0|
|(3000,[1,40,114,5...| 2.0|
|(3000,[1,363,743,...| 2.0|
|(3000,[2,20,65,88...| 3.0|
|(3000,[3,15,21,23...| 3.0|
|(3000,[3,45,53,14...| 3.0|
|(3000,[3,387,433,...| 1.0|
|(3000,[3,523,629,...| 3.0|
+--------------------+--------+
分类后,我的预测是:
+--------------------+--------+----------+
| tfIdf|estimate|prediction|
+--------------------+--------+----------+
|(3000,[0,1,8,14,1...| 3.0| 0.0|
|(3000,[0,1707,223...| 3.0| 0.0|
|(3000,[1,24,33,64...| 3.0| 0.0|
|(3000,[1,40,114,5...| 2.0| 0.0|
|(3000,[1,363,743,...| 2.0| 0.0|
|(3000,[2,20,65,88...| 3.0| 0.0|
|(3000,[3,15,21,23...| 3.0| 0.0|
|(3000,[3,45,53,14...| 3.0| 0.0|
|(3000,[3,387,433,...| 1.0| 0.0|
|(3000,[3,523,629,...| 3.0| 0.0|
+--------------------+--------+----------+
我的代码如下:
val toDouble = udf[Double, String](_.toDouble)
val kribrumData = krData.withColumn("estimate", toDouble(krData("estimate")))
.select($"text",$"estimate")
kribrumData.cache()
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("tokens")
val stopWordsRemover = new StopWordsRemover()
.setInputCol("tokens")
.setOutputCol("filtered")
.setStopWords(STOP_WORDS)
val hashingTF = new HashingTF()
.setInputCol("filtered")
.setNumFeatures(3000)
.setOutputCol("tf")
val idf = new IDF()
.setInputCol("tf")
.setOutputCol("tfIdf")
val preprocessor = new Pipeline()
.setStages(Array(tokenizer,stopWordsRemover,hashingTF,idf))
val preprocessor_model = preprocessor.fit(kribrumData)
val preprocessedKribrumData = preprocessor_model.transform(kribrumData)
.select("tfIdf", "estimate")
var Array(train, test) = preprocessedKribrumData.randomSplit(Array(0.8, 0.2), seed = 7)
test.show(10)
val logisticRegressor = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setLabelCol("estimate")
.setFeaturesCol("tfIdf")
val classifier = new OneVsRest()
.setLabelCol("estimate")
.setFeaturesCol("tfIdf")
.setClassifier(logisticRegressor)
val model = classifier.fit(train)
val predictions = model.transform(test)
predictions.show(10)
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("accuracy").setLabelCol("estimate")
val accuracy = evaluator.evaluate(predictions)
println("Classification accuracy" + accuracy.toString)
此代码最终激发预测准确度为零(因为目标列中没有标签“0.0”"estimate")。那么,我到底做错了什么?任何想法将不胜感激。
终于找到问题所在了。 Spark 不会抛出错误或异常,当标签字段为 double,但标签不在分类器的有效范围内时,需要使用 StringIndexer 来克服这种用法,因此我们只需要在管道中添加:
val labelIndexer = new StringIndexer()
.setInputCol("estimate")
.setOutputCol("indexedLabel")
这一步解决了问题,但是不方便。