ClassCastException:java.lang.Double 无法转换为 org。 apache.spark.mllib.linalg.Vector 使用 LabeledPoint 时
ClassCastException: java.lang.Double cannot be cast to org. apache.spark.mllib.linalg.Vector While using LabeledPoint
我正在尝试使用 SVMWithSGD 来训练我的模型,但我在尝试访问我的训练时遇到了 ClassCastException。
我的 train_data 数据框架构如下所示:
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
我创建了一个 LabeledPoint RDD 以在 SVNWithSGD 上使用它
val targetInd = train_data.columns.indexOf("label_index")`
val featInd = Array("features").map(train_data.columns.indexOf(_))
val train_lp = train_data.rdd.map(r => LabeledPoint( r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
但是当我打电话
SVMWithSGD.train(train_lp, 迭代次数)
它给了我:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSched
uler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGSche
duler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGSche
duler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:
59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.appl
y(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.appl
y(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.sc
ala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGSche
duler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2048)
at org.apache.spark.util.EventLoop$$anon.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$take.apply(RDD.scala:1364)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
at org.apache.spark.rdd.RDD$$anonfun$first.apply(RDD.scala:1378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.generateInitia
lWeights(GeneralizedLinearAlgorithm.scala:204)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(Generalize
dLinearAlgorithm.scala:234)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:217)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:255)
... 55 elided
Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to org.
apache.spark.mllib.linalg.Vector
我的 train_data 是基于标签 (file_name) 和特征(json 表示图像特征的文件)创建的。
试试这个 -
架构
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
将您的代码修改为-
val train_lp = train_data.rdd.map(r => LabeledPoint(r.getAs("label_index"), r.getAs("features")))
我正在尝试使用 SVMWithSGD 来训练我的模型,但我在尝试访问我的训练时遇到了 ClassCastException。 我的 train_data 数据框架构如下所示:
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
我创建了一个 LabeledPoint RDD 以在 SVNWithSGD 上使用它
val targetInd = train_data.columns.indexOf("label_index")`
val featInd = Array("features").map(train_data.columns.indexOf(_))
val train_lp = train_data.rdd.map(r => LabeledPoint( r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
但是当我打电话 SVMWithSGD.train(train_lp, 迭代次数)
它给了我:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSched
uler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGSche
duler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGSche
duler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:
59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.appl
y(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.appl
y(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.sc
ala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGSche
duler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedu
ler.scala:2048)
at org.apache.spark.util.EventLoop$$anon.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$take.apply(RDD.scala:1364)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
at org.apache.spark.rdd.RDD$$anonfun$first.apply(RDD.scala:1378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
51)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:1
12)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.generateInitia
lWeights(GeneralizedLinearAlgorithm.scala:204)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(Generalize
dLinearAlgorithm.scala:234)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:217)
at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:255)
... 55 elided
Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to org.
apache.spark.mllib.linalg.Vector
我的 train_data 是基于标签 (file_name) 和特征(json 表示图像特征的文件)创建的。
试试这个 -
架构
train_data.printSchema
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
|-- label_index: double (nullable = false)
将您的代码修改为-
val train_lp = train_data.rdd.map(r => LabeledPoint(r.getAs("label_index"), r.getAs("features")))