R 项目组合 3 组

R item combinations group of 3

这是找到 的解决方案,但是三元组呢?

如果我有:

consumer=c(1,1,1,1,1,2,2,2,2,3,3,4,4,4,4,5)
items=c("apple","banana","carrot","date","eggplant","apple","banana",
        "fig","grape","apple","banana","apple","carrot","date",
        "eggplant","apple")
shoppinglists <- data.frame(consumer,items)
table(shoppinglists)

有没有一种简单的方法可以找到最多的三元组组合?例如,三元组 "carrots"+"date"+"eggplant", "apple"+"carrots"+"date", "apple"+"carrots"+"eggplant"和"apple"+"date"+"eggplant"分别出现在两个列表中(消费者1和4)。

可以看到并列第二名有很多个出现一次:A+B+C,A+B+D,A+B+E,B+C+D,B+C+E(消费者 1); A+B+F,A+B+G(消费者 2)。

这是一个 data.table 的答案,它很容易扩展到四倍等:

library(data.table); setDT(shoppinglists)

#exclude if consumer didn't buy 3 goods
shoppinglists[ , if (.N >= 3L) 
  .(triplet =
      #get the combinations 3 at a time;
      #  keep them as a list (simplify=FALSE)
      #  for easy post-manipulation with sapply
      sapply(combn(items, 3L, simplify = FALSE),
             #**should be a better way...**
             paste, collapse = ",")), 
  by = consumer
  #now count the total frequency of each triplet
  ][ , .N, by = triplet
     #and sort to see the most frequent
     ][order(-N)]
#                    triplet N
#  1:      apple,carrot,date 2
#  2:  apple,carrot,eggplant 2
#  3:    apple,date,eggplant 2
#  4:   carrot,date,eggplant 2
#  5:    apple,banana,carrot 1
#  6:      apple,banana,date 1
#  7:  apple,banana,eggplant 1
#  8:     banana,carrot,date 1
#  9: banana,carrot,eggplant 1
# 10:   banana,date,eggplant 1
# 11:       apple,banana,fig 1
# 12:     apple,banana,grape 1
# 13:        apple,fig,grape 1
# 14:       banana,fig,grape 1

对于双打,我们可以使用combn(value, 2L);对于四胞胎,combn(value, 4L),等等

order(-N) 替换为 N == max(N) 以排除除最常见以外的所有内容。

我希望我们不必 paste-collapse 这个 -- 我希望 list() 可以工作,但是 by 一个 list专栏显然不起作用。

您可以使用 arules 包。如果您正在做很多这样的工作,那么值得探索,因为它:

Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.

这里有一个使用eclat算法的解决方案:

# Set up the object you'll pass to eclat:
tbl <- table(shoppinglists)
itemList <- matrix(tbl)
dim(itemList) <- dim(tbl)
colnames(itemList) <- colnames(tbl)

现在,您可以使用 eclat。有一个 support 参数用于指定项集被视为频繁所需的最小支持度。在这种情况下,无论频率如何,您都想要一切,因此您可以将 support 设置为 0。您将收到一条警告,将其设置为 0 可能会导致 运行 内存不足。

library(arules)
d <- eclat(itemList, parameter = list(minlen = 3, maxlen = 3, support = 0))

您可以使用 d 中包含的数据构建您想要的 data.frame。通过将支持度 (quality(d)) 乘以交易总数 (info(d)$ntransactions) 生成每个项目集的交易数:

d2 <- data.frame(items = labels(d), quality(d) * info(d)$ntransactions)
names(d2)[2] <- "N" # to rename from "support" to "N"
d2
#                      items N
#1         {apple,fig,grape} 1
#2        {banana,fig,grape} 1
#3      {apple,banana,grape} 1
#4        {apple,banana,fig} 1
#5     {apple,date,eggplant} 2
#6    {banana,date,eggplant} 1
#7    {carrot,date,eggplant} 2
#8   {apple,carrot,eggplant} 2
#9  {banana,carrot,eggplant} 1
#10  {apple,banana,eggplant} 1
#11      {apple,carrot,date} 2
#12     {banana,carrot,date} 1
#13      {apple,banana,date} 1
#14    {apple,banana,carrot} 1