Spark ML - 从 CrossValidator 中的最佳模型中检索参数
SparkML - Retriving parameters from the bestModel in CrossValidator
我正在使用 StringIndexer、OneHotEncoderEstimator 和 RandomForestRegressor 在 Spark 2.3 中训练随机森林模型。像这样:
//Indexer
val stringIndexers = categoricalColumns.map { colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Idx")
.setHandleInvalid("keep")
.fit(training)
}
//HotEncoder
val encoders = featuresEnconding.map { colName =>
new OneHotEncoderEstimator()
.setInputCols(Array(colName + "Idx"))
.setOutputCols(Array(colName + "Enc"))
.setHandleInvalid("keep")
}
//Adding features into a feature vector column
val assembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setMaxBins(1000)
val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)
val pipelineRF = new Pipeline().setStages(stepsRF)
val paramGridRF = new ParamGridBuilder()
.addGrid(rf.minInstancesPerNode, Array(1, 5, 15))
.addGrid(rf.maxDepth, Array(10, 11, 12))
.addGrid(rf.numTrees, Array(20, 50, 100))
.build()
//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
//Using cross validation to train the model
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)
//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)
我不确定这个模型的最佳参数组合是什么,所以我添加了以下参数网格:
.addGrid(rf.minInstancesPerNode, Array(1, 5, 15))
.addGrid(rf.maxDepth, Array(10, 11, 12))
.addGrid(rf.numTrees, Array(20, 50, 100))
然后我让 CrossValidator 计算最佳组合。现在我想要的是找出它选择了哪种组合,以便从那里继续调整模型。所以我试图像这样获得这个参数:
cvRFModel.bestModel.extractParamMap
但是我得到一张空地图:
org.apache.spark.ml.param.ParamMap =
{
}
我错过了什么?
基于以下我尝试了这个,但我不确定这是否是正确的做法:
val avgMetricsParamGrid = cvRFModel.avgMetrics
val combined = paramGridRF.zip(avgMetricsParamGrid)
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val parms = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].explainParams
它给了我几个这样的参数信息:
labelCol: label column name (default: label, current: label) maxBins:
Max number of bins for discretizing continuous features. Must be >=2
and >= number of categories for any categorical feature. (default: 32,
current: 1000) maxDepth: Maximum depth of the tree. (>= 0) E.g., depth
0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
(default: 5, current: 12) maxMemoryInMB: Maximum memory in MB
allocated to histogram aggregation. (default: 256) minInfoGain:
Minimum information gain for a split to be considered at a tree node.
(default: 0.0) minInstancesPerNode: Minimum number of instances each
child must have after split. If a split causes the left or right
child to have fewer than minInstancesPerNode, the split will be
discarded as invalid. Should be >= 1. (default: 1, current: 1)
numTrees: Number of trees to train (>= 1) (default: 20, current: 20)
predictionCol: prediction column name (default: prediction) seed:
random seed (default: 235498149) subsamplingRate: Fraction of the
training data used for learning each decision tree, in range (0, 1].
(default: 1.0)
我还不确定我需要到哪个阶段select。我决定选择最后一个,因为训练过程是迭代的,但我不能 100% 确定这是否是正确答案。任何反馈将不胜感激。
我正在使用 StringIndexer、OneHotEncoderEstimator 和 RandomForestRegressor 在 Spark 2.3 中训练随机森林模型。像这样:
//Indexer
val stringIndexers = categoricalColumns.map { colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Idx")
.setHandleInvalid("keep")
.fit(training)
}
//HotEncoder
val encoders = featuresEnconding.map { colName =>
new OneHotEncoderEstimator()
.setInputCols(Array(colName + "Idx"))
.setOutputCols(Array(colName + "Enc"))
.setHandleInvalid("keep")
}
//Adding features into a feature vector column
val assembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setMaxBins(1000)
val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)
val pipelineRF = new Pipeline().setStages(stepsRF)
val paramGridRF = new ParamGridBuilder()
.addGrid(rf.minInstancesPerNode, Array(1, 5, 15))
.addGrid(rf.maxDepth, Array(10, 11, 12))
.addGrid(rf.numTrees, Array(20, 50, 100))
.build()
//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
//Using cross validation to train the model
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)
//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)
我不确定这个模型的最佳参数组合是什么,所以我添加了以下参数网格:
.addGrid(rf.minInstancesPerNode, Array(1, 5, 15))
.addGrid(rf.maxDepth, Array(10, 11, 12))
.addGrid(rf.numTrees, Array(20, 50, 100))
然后我让 CrossValidator 计算最佳组合。现在我想要的是找出它选择了哪种组合,以便从那里继续调整模型。所以我试图像这样获得这个参数:
cvRFModel.bestModel.extractParamMap
但是我得到一张空地图:
org.apache.spark.ml.param.ParamMap =
{
}
我错过了什么?
基于以下
val avgMetricsParamGrid = cvRFModel.avgMetrics
val combined = paramGridRF.zip(avgMetricsParamGrid)
val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
val parms = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].explainParams
它给了我几个这样的参数信息:
labelCol: label column name (default: label, current: label) maxBins: Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature. (default: 32, current: 1000) maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 5, current: 12) maxMemoryInMB: Maximum memory in MB allocated to histogram aggregation. (default: 256) minInfoGain: Minimum information gain for a split to be considered at a tree node. (default: 0.0) minInstancesPerNode: Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1. (default: 1, current: 1) numTrees: Number of trees to train (>= 1) (default: 20, current: 20) predictionCol: prediction column name (default: prediction) seed: random seed (default: 235498149) subsamplingRate: Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0)
我还不确定我需要到哪个阶段select。我决定选择最后一个,因为训练过程是迭代的,但我不能 100% 确定这是否是正确答案。任何反馈将不胜感激。