Spark MLlib 关联规则置信度大于 1.0
Spark MLlib Association Rules confidence is greater than 1.0
我在使用Spark 2.0.2从一些数据中提取一些关联规则,当我得到结果时,我发现我有一些奇怪的规则,例如:
【[MUJI,ROEM,西单科技广场] => Bauhaus ] 2.0
“2.0”是打印规则的置信度,不是"the probability of antecedent to consequent"的意思,应该小于1.0吗?
关键词:交易!= freqItemset
SOLUTIONS:使用spark.mllib.FPGrowth代替,它接受一个rdd的交易并且可以自动计算freqItemsets。
你好,我找到了。出现这种现象的原因是因为我输入的FreqItemset数据freqItemsets有误。让我们进入细节。我简单的用了三个原始的transactions("a"),("a","b","c"),("a" ,"b","d"),它们出现的频率都一样 1.
一开始我以为spark会自动计算子itemset的频率,我唯一需要做的就是像这样创建freqItemsets(官方例子给我们看):
val freqItemsets = sc.parallelize(Seq(
new FreqItemset(Array("a"), 1),
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
))
这里是出错的原因,AssociationRules的params是FreqItemset,不是transactions,所以我理解错了这两个定义
根据三个交易,freqItemsets应该是
new FreqItemset(Array("a"), 3),//because "a" appears three times in three transactions
new FreqItemset(Array("b"), 2),//"b" appears two times
new FreqItemset(Array("c"), 1),
new FreqItemset(Array("d"), 1),
new FreqItemset(Array("a","b"), 2),// "a" and "b" totally appears two times
new FreqItemset(Array("a","c"), 1),
new FreqItemset(Array("a","d"), 1),
new FreqItemset(Array("b","d"), 1),
new FreqItemset(Array("b","c"), 1)
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
您可以使用以下代码自行完成此统计工作
val transactons = sc.parallelize(
Seq(
Array("a"),
Array("a","b","c"),
Array("a","b","d")
))
val freqItemsets = transactions
.map(arr => {
(for (i <- 1 to arr.length) yield {
arr.combinations(i).toArray
})
.toArray
.flatten
})
.flatMap(l => l)
.map(a => (Json.toJson(a.sorted).toString(), 1))
.reduceByKey(_ + _)
.map(m => new FreqItemset(Json.parse(m._1).as[Array[String]], m._2.toLong))
//then use freqItemsets like the example code
val ar = new AssociationRules()
.setMinConfidence(0.8)
val results = ar.run(freqItemsets)
//....
我们可以简单地使用 FPGrowth 而不是 "AssociationRules",它接受 rdd 的交易。
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions) //transactions is defined in the previous code
就这些了。
我在使用Spark 2.0.2从一些数据中提取一些关联规则,当我得到结果时,我发现我有一些奇怪的规则,例如:
【[MUJI,ROEM,西单科技广场] => Bauhaus ] 2.0
“2.0”是打印规则的置信度,不是"the probability of antecedent to consequent"的意思,应该小于1.0吗?
关键词:交易!= freqItemset
SOLUTIONS:使用spark.mllib.FPGrowth代替,它接受一个rdd的交易并且可以自动计算freqItemsets。
你好,我找到了。出现这种现象的原因是因为我输入的FreqItemset数据freqItemsets有误。让我们进入细节。我简单的用了三个原始的transactions("a"),("a","b","c"),("a" ,"b","d"),它们出现的频率都一样 1.
一开始我以为spark会自动计算子itemset的频率,我唯一需要做的就是像这样创建freqItemsets(官方例子给我们看):
val freqItemsets = sc.parallelize(Seq(
new FreqItemset(Array("a"), 1),
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
))
这里是出错的原因,AssociationRules的params是FreqItemset,不是transactions,所以我理解错了这两个定义
根据三个交易,freqItemsets应该是
new FreqItemset(Array("a"), 3),//because "a" appears three times in three transactions
new FreqItemset(Array("b"), 2),//"b" appears two times
new FreqItemset(Array("c"), 1),
new FreqItemset(Array("d"), 1),
new FreqItemset(Array("a","b"), 2),// "a" and "b" totally appears two times
new FreqItemset(Array("a","c"), 1),
new FreqItemset(Array("a","d"), 1),
new FreqItemset(Array("b","d"), 1),
new FreqItemset(Array("b","c"), 1)
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
您可以使用以下代码自行完成此统计工作
val transactons = sc.parallelize(
Seq(
Array("a"),
Array("a","b","c"),
Array("a","b","d")
))
val freqItemsets = transactions
.map(arr => {
(for (i <- 1 to arr.length) yield {
arr.combinations(i).toArray
})
.toArray
.flatten
})
.flatMap(l => l)
.map(a => (Json.toJson(a.sorted).toString(), 1))
.reduceByKey(_ + _)
.map(m => new FreqItemset(Json.parse(m._1).as[Array[String]], m._2.toLong))
//then use freqItemsets like the example code
val ar = new AssociationRules()
.setMinConfidence(0.8)
val results = ar.run(freqItemsets)
//....
我们可以简单地使用 FPGrowth 而不是 "AssociationRules",它接受 rdd 的交易。
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions) //transactions is defined in the previous code
就这些了。