使用 Apache Spark ML，您如何转换（用于预测）没有标签的数据集？

Question

我确定我在理解 Spark ML 的管道方面存在差距。

我有一个针对一组数据进行训练的管道，其架构为 "label"、"comment"（两个字符串）。我的管道转换 "label"，添加 "indexedLabel"，并通过标记化然后 HashingTF（以 "vectorizedComment" 结尾）矢量化 "comment" 管道以 LogisticRegression，带有标签列 "indexedLabel" 和特征列 "vectorizedComment".

而且效果很好！我可以适应我的管道并获得一个管道模型，它整天用 "label"、"comment" 转换数据集！但是，我的目标是能够仅抛出 "comment" 的数据集，因为 "label" 仅用于训练模型目的。

我确信我在理解管道预测的工作原理方面存在差距 - 有人可以为我指出吗？

Answer 1

标签的转换可以在管道外（即之前）完成。该标签仅在训练期间是必需的，而不是在 pipeline/model 的实际使用期间。通过在管道中执行标签转换，任何数据框都需要有一个不需要的标签列。

小例子：

val indexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")

val df2 = indexer.fit(df).transform(df)

// Create pipeline with other stages and use df2 to fit it

或者，您可以有两个独立的管道。一种包括在训练期间使用的标签转换，一种不包括它。确保其他阶段在两个管道中引用相同的对象。

val indexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")

// Create feature transformers and add to the pipelines

val pipelineTraining = new Pipeline().setStages(Array(indexer, ...))
val pipelineUsage = new Pipeline().setStages(Array(...))

使用 Apache Spark ML，您如何转换（用于预测）没有标签的数据集？

Using Apache Spark ML, how do you transform (for predictions) a dataset that doesn't have a label?

apache-spark

apache-spark-ml

apache-spark-mllib