我如何用 R 总结这些数据？

Question

我正在分析不同购物场所之间的客流。我有这样的数据：

df <- data.frame(customer.id=letters[seq(1,7)], 
                 shop.1=c(1,1,1,1,1,0,0),
                 shop.2=c(0,0,1,1,1,1,0),
                 shop.3=c(1,0,0,0,0,0,1))
df

#>   customer.id shop.1 shop.2 shop.3
#> 1           a      1      0      1
#> 2           b      1      0      0  
#> 3           c      1      1      0 
#> 4           d      1      1      0 
#> 5           e      1      1      0 
#> 6           f      0      1      0 
#> 7           g      0      0      1

因此，例如：

顾客 "a" 只在商店 1 和 3 购物，
顾客 "b" 只在商店 1 购物，
顾客 "c" 只在商店 1 和 2 购物，
等等

我想这样总结数据：

#>           shop.1 shop.2 shop.3 
#> shop.1         5      3      1
#> shop.2         3      4      0       
#> shop.3         1      0      2

因此，例如，第 1 行显示：

5个人在1号店和1号店都购物了（这显然是多余的观察）
3 人同时在商店 1 和商店 2 购物
1 人同时在商店 1 和商店 3 购物

我怎样才能做到这一点（请注意：我的数据集中有很多商店，因此首选可扩展的方法）？

Answer 1

您想列出 shop.* 个变量的共现：

df[,2:4] <- sapply(df[,2:4], function(x) { ifelse(x=="", 0, 1) } )

1) 据说可以使用 ftable(xtabs(...)) 来完成，但我为此苦苦挣扎了很长时间，却无法做到。我得到的最接近的是：

> ftable(xtabs(~ shop.1 + shop.2 + shop.3, df))

              shop.3 0 1
shop.1 shop.2           
0      0             0 1
       1             1 0
1      0             1 1
       1             3 0

2) 如@thelatemail 所示，您还可以：

# Transform your df from wide-form to long-form...
library(dplyr)
library(reshape2)
occurrence_df <- reshape2::melt(df, id.vars='customer.id') %>%
                 dplyr::filter(value==1)

   customer.id variable value
1            a   shop.1     1
2            b   shop.1     1
3            c   shop.1     1
4            d   shop.1     1
5            e   shop.1     1
6            c   shop.2     1
7            d   shop.2     1
8            e   shop.2     1
9            f   shop.2     1
10           a   shop.3     1
11           g   shop.3     1

我们真的可以在过滤器之后删除 value 列，这样我们就可以通过管道 %>% select(-value)

   customer.id variable
1            a   shop.1
2            b   shop.1
3            c   shop.1
4            d   shop.1
5            e   shop.1
6            c   shop.2
7            d   shop.2
8            e   shop.2
9            f   shop.2
10           a   shop.3
11           g   shop.3

# 然后与@thelatemail 的回答相同的交叉步骤：

crossprod(table(occurrence_df))

        variable
variable shop.1 shop.2 shop.3
  shop.1      5      3      1
  shop.2      3      4      0
  shop.3      1      0      2

（脚注：

首先你的数据应该是数字（或因子），而不是字符串。您想将 "x" 转换为 1 并将 "" 转换为 0.
如果它们是字符串，因为它们来自 read.csv，请使用 read.csv 参数 stringsAsFactors=TRUE 使它们成为因子，或 colClasses 使它们成为数字，并查看全部许多重复的问题。）

Answer 2

crossprod 可以处理您想做的事情，经过一些基本操作后将其分为 2 列，分别代表 customer 和 shop：

tmp <- cbind(df[1],stack(df[-1]))
tmp <- tmp[tmp$values==1,]

crossprod(table(tmp[c(1,3)]))

#        ind
#ind      shop.1 shop.2 shop.3
#  shop.1      5      3      1
#  shop.2      3      4      0
#  shop.3      1      0      2

Answer 3

实际上，矩阵运算似乎就足够了，因为数据框只有0和1。

首先，排除 customer.id 列并将 data.frame 更改为 matrix。这可能很容易。（mydf 是您的数据框的名称。）

# base R way
as.matrix(mydf[,-1])
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

library(dplyr) #dplyr way
(mymat <-
  mydf %>% 
  select(-customer.id) %>% 
  as.matrix())
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

有了这个矩阵，只需进行如下矩阵运算即可。

t(mymat) %*% mymat
#>        shop.1 shop.2 shop.3
#> shop.1      5      3      1
#> shop.2      3      4      0
#> shop.3      1      0      2

你可以得到你的答案。

我如何用 R 总结这些数据？

How can I summarize this data with R?

r

crosstab