data.table 按组获取 N 个最频繁的值
data.table get N most frequent values by group
假设我想为每个购买类别查找最常出现的前 3 个邮政编码。在此示例中,类别为 home、townhouse 和 condo。我有这样的交易数据:
set.seed(1234)
d <- data.table(purch_id = 1:3e6,
purch_cat = sample(x = c('home','townhouse','condo'),
size = 3e6, replace=TRUE),
purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE),
width = 5, format = "d", flag = "0") )
我知道我可以做到:
# there has to be a better way...
d[,list(purch_count = length(purch_id)),
by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'),
by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
但这不是最优雅的data.table
方法,而且看起来有点慢。
有什么减少传递次数的建议吗?也许使用 table()
? TYVM!
编辑:改进。
我认为您的方向完全正确。但是,您缺少的一个关键是函数 frank
,它已经过优化并且应该可以显着加快您的代码速度(几乎在您的 3m 行样本数据上立即运行):
d[ , .(purch_count = .N),
by = .(purch_cat, purch_zip)
][, purch_rank := frank(-purch_count, ties.method = 'min'),
keyby = purch_cat
][purch_rank <= 3,
][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
不完整的答案table
(慢):
是的,一种方法涉及使用 table
。
d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
purch_cat V1
1: condo 39169
2: condo 15725
3: condo 72023
4: condo 75768
5: home 56053
6: home 71294
7: home 25302
8: home 39488
9: home 57971
10: home 65292
11: home 70124
12: home 77521
13: home 16943
14: home 43003
15: home 43426
16: home 76501
17: home 81754
18: home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
purch_cat V1
一种方法是
d[ , .N, by=.(purch_cat, purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ]
,by=purch_cat]
这给出了
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: townhouse 83099 28
5: townhouse 33518 28
6: townhouse 59347 28
7: townhouse 22402 28
8: condo 39169 32
9: condo 15725 31
10: condo 75768 30
11: condo 72023 30
12: home 71294 30
13: home 56053 30
14: home 57971 29
15: home 77521 29
16: home 70124 29
17: home 25302 29
18: home 65292 29
19: home 39488 29
20: home 81754 28
21: home 43426 28
22: home 16943 28
23: home 88978 28
24: home 43003 28
25: home 76501 28
purch_cat purch_zip N
要实现 OP 的平局规则,可以这样做
d[ , .N, by=.(purch_cat,purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ][
.N - frank(N, ties.method='max') < 3 ]
, by=purch_cat]
这给出了
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: condo 39169 32
5: condo 15725 31
6: condo 75768 30
7: condo 72023 30
8: home 71294 30
9: home 56053 30
10: home 57971 29
11: home 77521 29
12: home 70124 29
13: home 25302 29
14: home 65292 29
15: home 39488 29
根据@MichaelChirico 的回答,此方法添加了一个 frank
步骤。
假设我想为每个购买类别查找最常出现的前 3 个邮政编码。在此示例中,类别为 home、townhouse 和 condo。我有这样的交易数据:
set.seed(1234)
d <- data.table(purch_id = 1:3e6,
purch_cat = sample(x = c('home','townhouse','condo'),
size = 3e6, replace=TRUE),
purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE),
width = 5, format = "d", flag = "0") )
我知道我可以做到:
# there has to be a better way...
d[,list(purch_count = length(purch_id)),
by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'),
by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
但这不是最优雅的data.table
方法,而且看起来有点慢。
有什么减少传递次数的建议吗?也许使用 table()
? TYVM!
编辑:改进。
我认为您的方向完全正确。但是,您缺少的一个关键是函数 frank
,它已经过优化并且应该可以显着加快您的代码速度(几乎在您的 3m 行样本数据上立即运行):
d[ , .(purch_count = .N),
by = .(purch_cat, purch_zip)
][, purch_rank := frank(-purch_count, ties.method = 'min'),
keyby = purch_cat
][purch_rank <= 3,
][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
不完整的答案table
(慢):
是的,一种方法涉及使用 table
。
d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
purch_cat V1
1: condo 39169
2: condo 15725
3: condo 72023
4: condo 75768
5: home 56053
6: home 71294
7: home 25302
8: home 39488
9: home 57971
10: home 65292
11: home 70124
12: home 77521
13: home 16943
14: home 43003
15: home 43426
16: home 76501
17: home 81754
18: home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
purch_cat V1
一种方法是
d[ , .N, by=.(purch_cat, purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ]
,by=purch_cat]
这给出了
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: townhouse 83099 28
5: townhouse 33518 28
6: townhouse 59347 28
7: townhouse 22402 28
8: condo 39169 32
9: condo 15725 31
10: condo 75768 30
11: condo 72023 30
12: home 71294 30
13: home 56053 30
14: home 57971 29
15: home 77521 29
16: home 70124 29
17: home 25302 29
18: home 65292 29
19: home 39488 29
20: home 81754 28
21: home 43426 28
22: home 16943 28
23: home 88978 28
24: home 43003 28
25: home 76501 28
purch_cat purch_zip N
要实现 OP 的平局规则,可以这样做
d[ , .N, by=.(purch_cat,purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ][
.N - frank(N, ties.method='max') < 3 ]
, by=purch_cat]
这给出了
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: condo 39169 32
5: condo 15725 31
6: condo 75768 30
7: condo 72023 30
8: home 71294 30
9: home 56053 30
10: home 57971 29
11: home 77521 29
12: home 70124 29
13: home 25302 29
14: home 65292 29
15: home 39488 29
根据@MichaelChirico 的回答,此方法添加了一个 frank
步骤。