R 项目组合 3 组
R item combinations group of 3
这是找到 的解决方案,但是三元组呢?
如果我有:
consumer=c(1,1,1,1,1,2,2,2,2,3,3,4,4,4,4,5)
items=c("apple","banana","carrot","date","eggplant","apple","banana",
"fig","grape","apple","banana","apple","carrot","date",
"eggplant","apple")
shoppinglists <- data.frame(consumer,items)
table(shoppinglists)
有没有一种简单的方法可以找到最多的三元组组合?例如,三元组 "carrots"+"date"+"eggplant", "apple"+"carrots"+"date", "apple"+"carrots"+"eggplant"和"apple"+"date"+"eggplant"分别出现在两个列表中(消费者1和4)。
可以看到并列第二名有很多个出现一次:A+B+C,A+B+D,A+B+E,B+C+D,B+C+E(消费者 1); A+B+F,A+B+G(消费者 2)。
这是一个 data.table
的答案,它很容易扩展到四倍等:
library(data.table); setDT(shoppinglists)
#exclude if consumer didn't buy 3 goods
shoppinglists[ , if (.N >= 3L)
.(triplet =
#get the combinations 3 at a time;
# keep them as a list (simplify=FALSE)
# for easy post-manipulation with sapply
sapply(combn(items, 3L, simplify = FALSE),
#**should be a better way...**
paste, collapse = ",")),
by = consumer
#now count the total frequency of each triplet
][ , .N, by = triplet
#and sort to see the most frequent
][order(-N)]
# triplet N
# 1: apple,carrot,date 2
# 2: apple,carrot,eggplant 2
# 3: apple,date,eggplant 2
# 4: carrot,date,eggplant 2
# 5: apple,banana,carrot 1
# 6: apple,banana,date 1
# 7: apple,banana,eggplant 1
# 8: banana,carrot,date 1
# 9: banana,carrot,eggplant 1
# 10: banana,date,eggplant 1
# 11: apple,banana,fig 1
# 12: apple,banana,grape 1
# 13: apple,fig,grape 1
# 14: banana,fig,grape 1
对于双打,我们可以使用combn(value, 2L)
;对于四胞胎,combn(value, 4L)
,等等
将 order(-N)
替换为 N == max(N)
以排除除最常见以外的所有内容。
我希望我们不必 paste
-collapse
这个 -- 我希望 list()
可以工作,但是 by
一个 list
专栏显然不起作用。
您可以使用 arules
包。如果您正在做很多这样的工作,那么值得探索,因为它:
Provides the infrastructure for representing, manipulating and
analyzing transaction data and patterns (frequent itemsets and
association rules). Also provides interfaces to C implementations of
the association mining algorithms Apriori and Eclat by C. Borgelt.
这里有一个使用eclat算法的解决方案:
# Set up the object you'll pass to eclat:
tbl <- table(shoppinglists)
itemList <- matrix(tbl)
dim(itemList) <- dim(tbl)
colnames(itemList) <- colnames(tbl)
现在,您可以使用 eclat
。有一个 support
参数用于指定项集被视为频繁所需的最小支持度。在这种情况下,无论频率如何,您都想要一切,因此您可以将 support
设置为 0。您将收到一条警告,将其设置为 0 可能会导致 运行 内存不足。
library(arules)
d <- eclat(itemList, parameter = list(minlen = 3, maxlen = 3, support = 0))
您可以使用 d
中包含的数据构建您想要的 data.frame。通过将支持度 (quality(d)
) 乘以交易总数 (info(d)$ntransactions
) 生成每个项目集的交易数:
d2 <- data.frame(items = labels(d), quality(d) * info(d)$ntransactions)
names(d2)[2] <- "N" # to rename from "support" to "N"
d2
# items N
#1 {apple,fig,grape} 1
#2 {banana,fig,grape} 1
#3 {apple,banana,grape} 1
#4 {apple,banana,fig} 1
#5 {apple,date,eggplant} 2
#6 {banana,date,eggplant} 1
#7 {carrot,date,eggplant} 2
#8 {apple,carrot,eggplant} 2
#9 {banana,carrot,eggplant} 1
#10 {apple,banana,eggplant} 1
#11 {apple,carrot,date} 2
#12 {banana,carrot,date} 1
#13 {apple,banana,date} 1
#14 {apple,banana,carrot} 1
这是找到
如果我有:
consumer=c(1,1,1,1,1,2,2,2,2,3,3,4,4,4,4,5)
items=c("apple","banana","carrot","date","eggplant","apple","banana",
"fig","grape","apple","banana","apple","carrot","date",
"eggplant","apple")
shoppinglists <- data.frame(consumer,items)
table(shoppinglists)
有没有一种简单的方法可以找到最多的三元组组合?例如,三元组 "carrots"+"date"+"eggplant", "apple"+"carrots"+"date", "apple"+"carrots"+"eggplant"和"apple"+"date"+"eggplant"分别出现在两个列表中(消费者1和4)。
可以看到并列第二名有很多个出现一次:A+B+C,A+B+D,A+B+E,B+C+D,B+C+E(消费者 1); A+B+F,A+B+G(消费者 2)。
这是一个 data.table
的答案,它很容易扩展到四倍等:
library(data.table); setDT(shoppinglists)
#exclude if consumer didn't buy 3 goods
shoppinglists[ , if (.N >= 3L)
.(triplet =
#get the combinations 3 at a time;
# keep them as a list (simplify=FALSE)
# for easy post-manipulation with sapply
sapply(combn(items, 3L, simplify = FALSE),
#**should be a better way...**
paste, collapse = ",")),
by = consumer
#now count the total frequency of each triplet
][ , .N, by = triplet
#and sort to see the most frequent
][order(-N)]
# triplet N
# 1: apple,carrot,date 2
# 2: apple,carrot,eggplant 2
# 3: apple,date,eggplant 2
# 4: carrot,date,eggplant 2
# 5: apple,banana,carrot 1
# 6: apple,banana,date 1
# 7: apple,banana,eggplant 1
# 8: banana,carrot,date 1
# 9: banana,carrot,eggplant 1
# 10: banana,date,eggplant 1
# 11: apple,banana,fig 1
# 12: apple,banana,grape 1
# 13: apple,fig,grape 1
# 14: banana,fig,grape 1
对于双打,我们可以使用combn(value, 2L)
;对于四胞胎,combn(value, 4L)
,等等
将 order(-N)
替换为 N == max(N)
以排除除最常见以外的所有内容。
我希望我们不必 paste
-collapse
这个 -- 我希望 list()
可以工作,但是 by
一个 list
专栏显然不起作用。
您可以使用 arules
包。如果您正在做很多这样的工作,那么值得探索,因为它:
Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.
这里有一个使用eclat算法的解决方案:
# Set up the object you'll pass to eclat:
tbl <- table(shoppinglists)
itemList <- matrix(tbl)
dim(itemList) <- dim(tbl)
colnames(itemList) <- colnames(tbl)
现在,您可以使用 eclat
。有一个 support
参数用于指定项集被视为频繁所需的最小支持度。在这种情况下,无论频率如何,您都想要一切,因此您可以将 support
设置为 0。您将收到一条警告,将其设置为 0 可能会导致 运行 内存不足。
library(arules)
d <- eclat(itemList, parameter = list(minlen = 3, maxlen = 3, support = 0))
您可以使用 d
中包含的数据构建您想要的 data.frame。通过将支持度 (quality(d)
) 乘以交易总数 (info(d)$ntransactions
) 生成每个项目集的交易数:
d2 <- data.frame(items = labels(d), quality(d) * info(d)$ntransactions)
names(d2)[2] <- "N" # to rename from "support" to "N"
d2
# items N
#1 {apple,fig,grape} 1
#2 {banana,fig,grape} 1
#3 {apple,banana,grape} 1
#4 {apple,banana,fig} 1
#5 {apple,date,eggplant} 2
#6 {banana,date,eggplant} 1
#7 {carrot,date,eggplant} 2
#8 {apple,carrot,eggplant} 2
#9 {banana,carrot,eggplant} 1
#10 {apple,banana,eggplant} 1
#11 {apple,carrot,date} 2
#12 {banana,carrot,date} 1
#13 {apple,banana,date} 1
#14 {apple,banana,carrot} 1