来自 mllib 的梯度提升树中的类型错误
Type Error in Gradient Boosted Trees from mllib
我尝试 运行 对一些混合类型的数据使用梯度提升树算法:
[('feature1', 'bigint'),
('feature2', 'int'),
('label', 'double')]
我尝试了以下方法
from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.ml.feature import VectorAssembler
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
vectorAssembler = VectorAssembler(inputCols = ["feature1", "feature2"], outputCol = "features")
data_assembled = vectorAssembler.transform(data)
data_assembled = data_assembled.select(['features', 'label'])
data_assembled = data_assembled.select(F.col("features"), F.col("label"))\
.rdd\
.map(lambda row: LabeledPoint(MLLibVectors.fromML(row.label), MLLibVectors.fromML(row.features)))
(trainingData, testData) = data_assembled.randomSplit([0.9, 0.1])
model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo={}, numIterations=100)
但是我收到以下错误:
TypeError: Unsupported vector type <class 'float'>
但是我的 none 类型实际上是浮动的。此外,如果相关,feature2 是二进制的。
我最终避免了 mllib 实现,转而使用 Spark ML:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
vectorAssembler = VectorAssembler(inputCols = ["feature1", "feature2"], outputCol = "features")
data_assembled = vectorAssembler.transform(data)
data_assembled = data_assembled.select(F.col("label"), F.col("features"))
(trainingData, testData) = data_assembled.randomSplit([0.7, 0.3])
gbt_model = GBTRegressor(featuresCol="features", maxIter=10).fit(trainingData)
Python 没有 LabeledPoint 对象所需的双精度类型,因此我假设来自 pyspark 的映射导致转换为浮点数。
我尝试 运行 对一些混合类型的数据使用梯度提升树算法:
[('feature1', 'bigint'),
('feature2', 'int'),
('label', 'double')]
我尝试了以下方法
from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.ml.feature import VectorAssembler
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
vectorAssembler = VectorAssembler(inputCols = ["feature1", "feature2"], outputCol = "features")
data_assembled = vectorAssembler.transform(data)
data_assembled = data_assembled.select(['features', 'label'])
data_assembled = data_assembled.select(F.col("features"), F.col("label"))\
.rdd\
.map(lambda row: LabeledPoint(MLLibVectors.fromML(row.label), MLLibVectors.fromML(row.features)))
(trainingData, testData) = data_assembled.randomSplit([0.9, 0.1])
model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo={}, numIterations=100)
但是我收到以下错误:
TypeError: Unsupported vector type <class 'float'>
但是我的 none 类型实际上是浮动的。此外,如果相关,feature2 是二进制的。
我最终避免了 mllib 实现,转而使用 Spark ML:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
vectorAssembler = VectorAssembler(inputCols = ["feature1", "feature2"], outputCol = "features")
data_assembled = vectorAssembler.transform(data)
data_assembled = data_assembled.select(F.col("label"), F.col("features"))
(trainingData, testData) = data_assembled.randomSplit([0.7, 0.3])
gbt_model = GBTRegressor(featuresCol="features", maxIter=10).fit(trainingData)
Python 没有 LabeledPoint 对象所需的双精度类型,因此我假设来自 pyspark 的映射导致转换为浮点数。