PySpark，决策树（Spark 2.0.0）

Question

我是 Spark 的新手（使用 PySpark）。我尝试了运行来自 here (link) 的决策树教程。我执行代码：

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils

# Load and parse the data file, converting it to a DataFrame.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Now this line fails
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

我收到错误消息：

IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.'

在网上搜索此错误时，我找到了一个答案：

use
from pyspark.ml.linalg import Vectors, VectorUDT
instead of
from pyspark.mllib.linalg import Vectors, VectorUDT

这很奇怪，因为我还没有使用过它。此外，将此导入添加到我的代码中也没有解决任何问题，我仍然遇到相同的错误。

我不太清楚如何调试这种情况。查看原始数据时，我看到：

data.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(692,[127,128,129...|  0.0|
|(692,[158,159,160...|  1.0|
|(692,[124,125,126...|  1.0|
|(692,[152,153,154...|  1.0|

这看起来像一个列表，以“(”开头。

我不确定如何解决这个问题，甚至无法调试。

Answer 1

问题的根源似乎是执行spark 1.5.2。 spark 2.0.0 上的示例（参见下面对 spark 2.0 示例的参考）。

spark.ml和spark.mllib的区别

从Spark 2.0开始，spark.mllib包中基于RDD的API进入维护模式。 Spark 的主要机器学习 API 现在是 spark.ml 包中基于 DataFrame 的 API。

可以在此处找到更多详细信息：http://spark.apache.org/docs/latest/ml-guide.html

使用 spark 2.0 请尝试 Spark 2.0.0 示例 (https://spark.apache.org/docs/2.0.0/mllib-decision-tree.html)

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

在 Spark 存储库的 "examples/src/main/python/mllib/decision_tree_classification_example.py" 中找到完整的示例代码。

PySpark，决策树（Spark 2.0.0）

PySpark, Decision Trees (Spark 2.0.0)

decision-tree

dataframe

apache-spark

apache-spark-sql

pyspark