spark中的FP增长模型
FP growth model in spark
我正在尝试 运行 使用 spark 2.2 MLlib 使用以下代码在 spark 中使用 FP 增长算法:
val fpgrowth = new FPGrowth()
.setItemsCol("items")
.setMinSupport(0.5)
.setMinConfidence(0.6)
val model = fpgrowth.fit(dataset1)
从 SQL 代码提取 dataset
的地方:
select items from MLtable
此 table 中 items
列的输出如下所示:
"NFL Cricket MLB Unknown1 Unknown2 Unknown Unknown Unknown",
"Unknown Unknown Unknown Unknown Unknown CCC DDD RRR",
"Unknown Unknown Unknown Unknown CFB Unknown Unknown Unknown",
"Unknown Cricket Unknown Unknown Unknown Unknown Unknown Unknown",
"NFL Unknown MLB NBA CFB Unknown Unknown Unknown"
每当我尝试 运行 我的 ML 模型时,我 运行 会出现以下错误:
Items in a transaction must be unique but got WrappedArray
我试过多次,但 运行 出错了。非常感谢这里的任何帮助。
正如错误消息告诉您的那样,交易中的项目必须是唯一的:
import org.apache.spark.sql.functions.{split, udf}
val df = Seq(
"NFL Cricket MLB Unknown1 Unknown2 Unknown Unknown Unknown",
"Unknown Unknown Unknown Unknown Unknown CCC DDD RRR",
"Unknown Unknown Unknown Unknown CFB Unknown Unknown Unknown",
"Unknown Cricket Unknown Unknown Unknown Unknown Unknown Unknown",
"NFL Unknown MLB NBA CFB Unknown Unknown Unknown"
).toDF("items")
val distinct = udf((xs: Seq[String]) => xs.distinct)
val items = df
.withColumn("items", split($"items", "\s+"))
// Keep only distinct values
.withColumn("items", distinct($"items"))
new FPGrowth().fit(items).freqItemsets.show
// +-------------------+----+
// | items|freq|
// +-------------------+----+
// | [MLB]| 2|
// | [MLB, NFL]| 2|
// |[MLB, NFL, Unknown]| 2|
// | [MLB, Unknown]| 2|
// | [Unknown]| 5|
// | [NFL]| 2|
// | [NFL, Unknown]| 2|
// | [Cricket]| 2|
// | [Cricket, Unknown]| 2|
// | [CFB]| 2|
// | [CFB, Unknown]| 2|
// +-------------------+----+
我正在尝试 运行 使用 spark 2.2 MLlib 使用以下代码在 spark 中使用 FP 增长算法:
val fpgrowth = new FPGrowth()
.setItemsCol("items")
.setMinSupport(0.5)
.setMinConfidence(0.6)
val model = fpgrowth.fit(dataset1)
从 SQL 代码提取 dataset
的地方:
select items from MLtable
此 table 中 items
列的输出如下所示:
"NFL Cricket MLB Unknown1 Unknown2 Unknown Unknown Unknown",
"Unknown Unknown Unknown Unknown Unknown CCC DDD RRR",
"Unknown Unknown Unknown Unknown CFB Unknown Unknown Unknown",
"Unknown Cricket Unknown Unknown Unknown Unknown Unknown Unknown",
"NFL Unknown MLB NBA CFB Unknown Unknown Unknown"
每当我尝试 运行 我的 ML 模型时,我 运行 会出现以下错误:
Items in a transaction must be unique but got WrappedArray
我试过多次,但 运行 出错了。非常感谢这里的任何帮助。
正如错误消息告诉您的那样,交易中的项目必须是唯一的:
import org.apache.spark.sql.functions.{split, udf}
val df = Seq(
"NFL Cricket MLB Unknown1 Unknown2 Unknown Unknown Unknown",
"Unknown Unknown Unknown Unknown Unknown CCC DDD RRR",
"Unknown Unknown Unknown Unknown CFB Unknown Unknown Unknown",
"Unknown Cricket Unknown Unknown Unknown Unknown Unknown Unknown",
"NFL Unknown MLB NBA CFB Unknown Unknown Unknown"
).toDF("items")
val distinct = udf((xs: Seq[String]) => xs.distinct)
val items = df
.withColumn("items", split($"items", "\s+"))
// Keep only distinct values
.withColumn("items", distinct($"items"))
new FPGrowth().fit(items).freqItemsets.show
// +-------------------+----+
// | items|freq|
// +-------------------+----+
// | [MLB]| 2|
// | [MLB, NFL]| 2|
// |[MLB, NFL, Unknown]| 2|
// | [MLB, Unknown]| 2|
// | [Unknown]| 5|
// | [NFL]| 2|
// | [NFL, Unknown]| 2|
// | [Cricket]| 2|
// | [Cricket, Unknown]| 2|
// | [CFB]| 2|
// | [CFB, Unknown]| 2|
// +-------------------+----+