Groupby 并将 df 中的两列转换为矩阵 R
Groupby and transform two columns in df into matrix R
我想将以下 data.frame 转换成一个矩阵,其中统计了每小时出现的每个自行车站 ID 的数量。
> dim(test)
[1] 80623 5
head(test, n = 10)
bikeid end.station.id start.station.id diff.time hour
1 16052 244 322 6544 14
2 16052 284 432 3406 21
3 16052 461 519 33416 3
4 16052 228 519 26876 13
5 16052 72 435 388 17
6 16052 319 127 27702 11
7 16052 282 2002 33882 4
8 16052 524 2021 2525 10
9 16052 387 351 2397 12
10 16052 388 526 32507 13
输出应该是这样的。
> sample2
start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34
2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31
3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7
4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4
5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67
6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50
7 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 55
8 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 55
9 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 40
10 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56
有人建议我使用类似于以下的公式:
matrix <- test %>%
group_by(start.station.id, hour)%>%
summarise(sum = nrow) %>%
spread(hour, nrow)
但不知道如何正确编码
使用data.table
:
library(data.table) #1.9.6+
setDT(test)
dcast(test[ , .N, by = .(start.station.id, hour)],
start.station.id ~ hour, value.var = "N")
或者(更慢,但更干净):
dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")
测试一些假数据:
set.seed(10932)
NN <- 1e6
test <- data.table(start.station.id = sample(1000, NN, T),
hour = sample(24, NN, T))
library(microbenchmark)
microbenchmark(times = 100L,
preagg = dcast(test[ , .N, by = .(start.station.id, hour)],
start.station.id ~ hour, value.var = "N"),
postagg = dcast(test, start.station.id ~ hour,
fun.aggregate = length, value.var = "hour"))
Unit: milliseconds
expr min lq mean median uq max neval
preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100
postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100
第一个更快的原因是操作 test[ , .N, by = vars]
在 data.table
中得到了优化。
我想将以下 data.frame 转换成一个矩阵,其中统计了每小时出现的每个自行车站 ID 的数量。
> dim(test) [1] 80623 5 head(test, n = 10) bikeid end.station.id start.station.id diff.time hour 1 16052 244 322 6544 14 2 16052 284 432 3406 21 3 16052 461 519 33416 3 4 16052 228 519 26876 13 5 16052 72 435 388 17 6 16052 319 127 27702 11 7 16052 282 2002 33882 4 8 16052 524 2021 2525 10 9 16052 387 351 2397 12 10 16052 388 526 32507 13
输出应该是这样的。
> sample2 start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34 2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31 3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7 4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4 5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67 6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50 7 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 55 8 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 55 9 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 40 10 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56
有人建议我使用类似于以下的公式:
matrix <- test %>%
group_by(start.station.id, hour)%>%
summarise(sum = nrow) %>%
spread(hour, nrow)
但不知道如何正确编码
使用data.table
:
library(data.table) #1.9.6+
setDT(test)
dcast(test[ , .N, by = .(start.station.id, hour)],
start.station.id ~ hour, value.var = "N")
或者(更慢,但更干净):
dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")
测试一些假数据:
set.seed(10932)
NN <- 1e6
test <- data.table(start.station.id = sample(1000, NN, T),
hour = sample(24, NN, T))
library(microbenchmark)
microbenchmark(times = 100L,
preagg = dcast(test[ , .N, by = .(start.station.id, hour)],
start.station.id ~ hour, value.var = "N"),
postagg = dcast(test, start.station.id ~ hour,
fun.aggregate = length, value.var = "hour"))
Unit: milliseconds
expr min lq mean median uq max neval
preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100
postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100
第一个更快的原因是操作 test[ , .N, by = vars]
在 data.table
中得到了优化。