Spark:如何将 LabeledPoint 特征值从 int 转换为 0/1?
Spark: How to transform LabeledPoint features values from int to 0/1?
我想在 Spark 中 运行 Naive Bayes
,但为此我必须将 features
值从我的 LabeledPoint
转换为 0/1。我的 LabeledPoint
看起来像这样:
scala> transformedData.collect()
res29: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0...
如何将这些 features
值转换为 1
(它是稀疏表示,因此不会有 0
)?
我猜你正在寻找这样的东西:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val transformedData = sc.parallelize(Seq(
LabeledPoint(1.0, Vectors.sparse(5, Array(1, 3), Array(9.0, 3.2))),
LabeledPoint(5.0, Vectors.sparse(5, Array(0, 2, 4), Array(1.0, 2.0, 3.0)))
))
def binarizeFeatures(rdd: RDD[LabeledPoint]) = rdd.map{
case LabeledPoint(label, features) => {
val v = features.toSparse
LabeledPoint(lab,
Vectors.sparse(v.size, v.indices, Array.fill(v.numNonzeros)(1.0)))}}
binarizeFeatures(transformedData).collect
// Array[org.apache.spark.mllib.regression.LabeledPoint] = Array(
// (1.0,(5,[1,3],[1.0,1.0])),
// (1.0,(5,[0,2,4],[1.0,1.0,1.0])))
我想在 Spark 中 运行 Naive Bayes
,但为此我必须将 features
值从我的 LabeledPoint
转换为 0/1。我的 LabeledPoint
看起来像这样:
scala> transformedData.collect()
res29: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0...
如何将这些 features
值转换为 1
(它是稀疏表示,因此不会有 0
)?
我猜你正在寻找这样的东西:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val transformedData = sc.parallelize(Seq(
LabeledPoint(1.0, Vectors.sparse(5, Array(1, 3), Array(9.0, 3.2))),
LabeledPoint(5.0, Vectors.sparse(5, Array(0, 2, 4), Array(1.0, 2.0, 3.0)))
))
def binarizeFeatures(rdd: RDD[LabeledPoint]) = rdd.map{
case LabeledPoint(label, features) => {
val v = features.toSparse
LabeledPoint(lab,
Vectors.sparse(v.size, v.indices, Array.fill(v.numNonzeros)(1.0)))}}
binarizeFeatures(transformedData).collect
// Array[org.apache.spark.mllib.regression.LabeledPoint] = Array(
// (1.0,(5,[1,3],[1.0,1.0])),
// (1.0,(5,[0,2,4],[1.0,1.0,1.0])))