Groupby 并将 df 中的两列转换为矩阵 R

Groupby and transform two columns in df into matrix R

我想将以下 data.frame 转换成一个矩阵,其中统计了每小时出现的每个自行车站 ID 的数量。


> dim(test)
[1] 80623     5

head(test, n = 10)
   bikeid end.station.id start.station.id diff.time hour
1   16052            244              322      6544   14
2   16052            284              432      3406   21
3   16052            461              519     33416    3
4   16052            228              519     26876   13
5   16052             72              435       388   17
6   16052            319              127     27702   11
7   16052            282             2002     33882    4
8   16052            524             2021      2525   10
9   16052            387              351      2397   12
10  16052            388              526     32507   13


输出应该是这样的。

> sample2
   start.station.id  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1                72 44  1 42 22  9 33 39 47 12 30 39 52 43 45 40 62  9 35 24 43 65 59 58 34
2                79 21 11  2 42  5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35  4 50 31
3                82 19 44  7 52 14 19  3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54  7
4                83 45 60 64  5  0  3 54 16 48 67 49 20 59 21 24 38 42 62 38 24  1 35 16  4
5               116 27 62 64 44 55 65 23 13 36  0 62 54 61  6 16  7 58 41 29  1 34 58 35 67
6               119 45 30 41 26  7 39 16 55 28 53 42  9  5 31 18 16 14 37 17 14 16 17 23 50
7               120  3  2  7 53 21 33 31 48 19 50 35 47  8 17 30  9 49  4 48 28 52  9 57 55
8               127 33 44 47 42  6 46 39 30 39 28 19 57 53 41 45 55  9 27 42 19 43 24 37 55
9               137 53 11 60  1 66 37 16  5  2 58  0 46 33  0 60 54 25 66 65 40 36 47 58 40
10              143 61  1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56

有人建议我使用类似于以下的公式:

matrix <- test %>% 
  group_by(start.station.id, hour)%>%
  summarise(sum = nrow) %>%
  spread(hour, nrow) 

但不知道如何正确编码

使用data.table

library(data.table) #1.9.6+
setDT(test)
dcast(test[ , .N, by = .(start.station.id, hour)],
      start.station.id ~ hour, value.var = "N")

或者(更慢,但更干净):

dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")

测试一些假数据:

set.seed(10932)
NN <- 1e6
test <- data.table(start.station.id = sample(1000, NN, T),
                   hour = sample(24, NN, T))

library(microbenchmark)

microbenchmark(times = 100L,
               preagg = dcast(test[ , .N, by = .(start.station.id, hour)],
                              start.station.id ~ hour, value.var = "N"),
               postagg = dcast(test, start.station.id ~ hour, 
                               fun.aggregate = length, value.var = "hour"))

Unit: milliseconds
    expr      min       lq      mean   median        uq      max neval
  preagg 55.83240 59.88939  66.56289 61.37408  64.37049 166.8902   100
 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270   100

第一个更快的原因是操作 test[ , .N, by = vars]data.table 中得到了优化。