从同时具有字符串和数字的 rdd 数据创建 LabeledPoint - PySpark

Question

我的数据中有这样的行：

0,tcp,http,SF,181,5450,0,0,0.5,normal.

我想使用决策树算法进行训练。我无法创建 LabeledPoints，所以我想为字符串尝试 HashingTF，但我无法处理它。 "normal" 是我的目标标签。如何创建 LabeledPoint RDD 数据以在 pyspark 中使用？另外，LabeledPoint 的标签需要双精度值，我应该只为标签创建一些双精度值还是应该对其进行哈希处理？

Answer 1

我想出了解决办法。

首先，Spark的决策树分类器已经有一个参数：categoricalFeaturesInfo。在 pyspark api 文档中：

categoricalFeaturesInfo - Map from categorical feature index to number of categories. Any feature not in this map is treated as continuous.

然而，在这样做之前，我们首先应该简单地将字符串替换为数字，以便 pypsark 理解它们。

然后我们为上面的示例数据创建 categoricalFeaturesInfo 就像这样的定义：

categoricalFeaturesInfo = {1:len(feature1), 2:len(feature2), 3:len(feature3), 9:len(labels)}

简单地说，第一个是分类特征的索引，第二个是该特征中的类别数。

请注意，将字符串转换为数字对于训练算法来说已经足够了，但如果您像这样声明分类特征，训练速度会更快。

从同时具有字符串和数字的 rdd 数据创建 LabeledPoint - PySpark

Create LabeledPoint from rdd data which has both strings and numbers - PySpark

python

apache-spark

rdd

pyspark