根据每行的频率查找排名
Find rank based on frequency for each row
我的数据包含时间变量和选择的品牌变量,如下所示。 time表示购物时间,chosenbrand表示当时购买的品牌。
根据这些数据,我想在下方创建第三列和第四列 table。在这里创建列有一些规则。第三(第四)列表示品牌 1(品牌 2)根据 5 天内的选择频率排名。如果 5 天内没有历史记录,那么它应该是 NA。
比如,我们来看第5行。第 5 行的 shoptime
是 2013-09-05 09:11:00
那么第 5 天 window 是 2013-08-31 09:11:00
~ 2013-09-05 09:11:00
。在这段时间内,出现了brand3、bradn3、brand2、
和 brand1(不包括第 5 行的 chosenbrand
)。根据最常选择的 brand1
(第三列)的排名是第二,brand2
的排名也是第二。所以第 5 行的两列都应该是 2 和 2。
再举个例子,让我们看看下面的最后一行table。该行的 shoptime
是 2013-09-09 09:32:00
然后第 5 天 window 是 2013-09-04 09:32:00
~ 2013-09-09 09:32:00
。在这个时间段内,出现了 brand1、bradn2、brand6、brand2 和 brand2(不包括该行的 chosenbrand
)。 brand1
(第三列)的排名,基于最常选择的,是第二位的,brand2
的排名是第一位的。所以行中的两列都应该是 2 和 1。
有什么简单的方法吗?
另外,如果我想单独做(如果每个客户都有多个购买记录),怎么办?
数据如下,
shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2
2013-09-01 08:35:00 brand3 NA NA
2013-09-02 08:54:00 brand3 NA NA
2013-09-03 09:07:00 brand2 NA NA
2013-09-04 09:08:00 brand1 NA 2
2013-09-05 09:11:00 brand1 2 2
2013-09-06 09:14:00 brand2 1 2
2013-09-07 09:26:00 brand6 1 1
2013-09-08 09:26:00 brand2 1 2
2013-09-09 09:29:00 brand2 2 1
2013-09-09 09:32:00 brand4 2 1
这是数据的代码
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
nth_most_freq_brand1 = NA,
nth_most_freq_brand2 = NA,
stringsAsFactors = FALSE)
使用tidyverse and lubridate的解决方案。
OP 的第一个问题
library(tidyverse)
library(lubridate)
第 1 步:将 shoptime
列转换为日期时间对象
dat <- dat %>% mutate(shoptime = ymd_hms(shoptime))
步骤 2:为所有 shoptime
创建查找 table。
complete
函数可以创建列之间的所有组合,因此我们可以创建shoptime
列(shoptime1
)的副本并创建所有组合。然后我们可以使用 filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
来查找日期和时间是否在 5 天之内。
dat2 <- dat %>%
mutate(shoptime1 = shoptime) %>%
select(contains("shoptime")) %>%
complete(shoptime, shoptime1) %>%
filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
第三步:合并dat
与查询table,统计品牌,对统计数字进行排序。
我们可以在 shoptime1
和 shoptime
的基础上合并查找 table、dat2
和 dat
。 count
函数可以按组统计出现的次数。之后,我们可以对 shoptime
进行分组,并使用 dense_rank
创建每个品牌在每个组中的排名。
dat3 <- dat2 %>%
left_join(dat, by = c("shoptime1" = "shoptime")) %>%
count(shoptime, chosenbrand) %>%
group_by(shoptime) %>%
mutate(rank = dense_rank(desc(n))) %>%
select(-n) %>%
spread(chosenbrand, rank) %>%
select(shoptime, brand1, brand2)
第四步:将原始数据框与dat3
数据框合并。
dat4 <- dat %>% left_join(dat3, by = "shoptime")
这是最终结果。
dat4
# shoptime chosenbrand brand1 brand2
# 1 2013-09-01 08:35:00 brand3 NA NA
# 2 2013-09-02 08:54:00 brand3 NA NA
# 3 2013-09-03 09:07:00 brand2 NA NA
# 4 2013-09-04 09:08:00 brand1 NA 2
# 5 2013-09-05 09:11:00 brand1 2 2
# 6 2013-09-06 09:14:00 brand2 1 2
# 7 2013-09-07 09:26:00 brand6 1 1
# 8 2013-09-08 09:26:00 brand2 1 2
# 9 2013-09-09 09:29:00 brand2 2 1
# 10 2013-09-09 09:32:00 brand4 2 1
OP 的第二个问题
由于OP没有提供示例数据集,我将使用示例数据集。只需对我的回答 1 稍作修改即可解决此问题。关键是在某些步骤中将 customer
列视为分组变量。
这是创建示例数据集的代码。我只在最后添加 as.tibble
将 data.table
对象转换为 tibble
.
library(data.table)
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
dat <- rbindlist(list(dat, dat[c(FALSE, TRUE), ]), idcol = "customer")
dat <- as.tibble(dat)
第 1 步:将 shoptime
列转换为日期时间对象
dat <- dat %>% mutate(shoptime = ymd_hms(shoptime))
步骤 2:为所有 shoptime
创建查找 table。
请注意,除了我们需要在应用 complete
函数之前对 customer
进行分组外,代码与上一个几乎相同。
dat2 <- dat %>%
mutate(shoptime1 = shoptime) %>%
select(contains("shoptime"), customer) %>%
group_by(customer) %>%
complete(shoptime, shoptime1) %>%
filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
第三步:合并dat
与查询table,统计品牌,对统计数字进行排序。
同样,我们在进行join操作和统计品牌时,需要考虑customer
列。
dat3 <- dat2 %>%
left_join(dat, by = c("customer", "shoptime1" = "shoptime")) %>%
count(customer, shoptime, chosenbrand) %>%
group_by(customer, shoptime) %>%
mutate(rank = dense_rank(-n)) %>%
select(-n) %>%
spread(chosenbrand, rank) %>%
select(customer, shoptime, brand1, brand2)
第四步:将原始数据框与dat3
数据框合并。
dat4 <- dat %>% left_join(dat3, by = c("customer", "shoptime"))
这是最终结果。我添加 as.data.frame
只是为了以更简单的格式打印输出。
dat4 %>% as.data.frame()
# customer shoptime chosenbrand brand1 brand2
# 1 1 2013-09-01 08:35:00 brand3 NA NA
# 2 1 2013-09-02 08:54:00 brand3 NA NA
# 3 1 2013-09-03 09:07:00 brand2 NA NA
# 4 1 2013-09-04 09:08:00 brand1 NA 2
# 5 1 2013-09-05 09:11:00 brand1 2 2
# 6 1 2013-09-06 09:14:00 brand2 1 2
# 7 1 2013-09-07 09:26:00 brand6 1 1
# 8 1 2013-09-08 09:26:00 brand2 1 2
# 9 1 2013-09-09 09:29:00 brand2 2 1
# 10 1 2013-09-09 09:32:00 brand4 2 1
# 11 2 2013-09-02 08:54:00 brand3 NA NA
# 12 2 2013-09-04 09:08:00 brand1 NA NA
# 13 2 2013-09-06 09:14:00 brand2 1 NA
# 14 2 2013-09-08 09:26:00 brand2 1 1
# 15 2 2013-09-09 09:32:00 brand4 NA 1
OP 问了一个非常相似的问题 。如果我理解正确的话,唯一的区别是
- 5 天而不是 36 小时的扩展时间范围(请注意,OP 指的是 时间段,而不是日期段)
- 只考虑
brand1
和 brand2
(而不是 chosenbrands
的所有值)。
因此, 可以在此处重复使用,只需进行一些调整和改进:
library(data.table)
library(lubridate)
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
selected_brands <- c("brand1", "brand2")
result <- dat[
.(lb = shoptime - hours(5 * 24), ub = shoptime),
on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = chosenbrand][, rank := frank(-N, ties.method="dense")]][
chosenbrand %in% selected_brands,
dcast(unique(.SD[, -1]), shoptime ~ paste0("nth_most_freq_", chosenbrand),
value.var = "rank")][
dat, on = "shoptime"]
# change column order to make it look more similar to the expected answer
setcolorder(result, c(1, 4, 2:3))
result
shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2
1: 2013-09-01 08:35:00 brand3 NA NA
2: 2013-09-02 08:54:00 brand3 NA NA
3: 2013-09-03 09:07:00 brand2 NA NA
4: 2013-09-04 09:08:00 brand1 NA 2
5: 2013-09-05 09:11:00 brand1 2 2
6: 2013-09-06 09:14:00 brand2 1 2
7: 2013-09-07 09:26:00 brand6 1 1
8: 2013-09-08 09:26:00 brand2 1 2
9: 2013-09-09 09:29:00 brand2 2 1
10: 2013-09-09 09:32:00 brand4 2 1
OP 第二个问题的答案
OP 问了一个额外的问题:
In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?
不幸的是,OP 没有为此案例提供示例数据集。因此,我们需要根据提供的数据集为两个客户组成一个数据集:
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
dat <- rbindlist(list(dat, dat[c(FALSE, TRUE), ]), idcol = "customer")
dat
customer shoptime chosenbrand
1: 1 2013-09-01 08:35:00 UTC brand3
2: 1 2013-09-02 08:54:00 UTC brand3
3: 1 2013-09-03 09:07:00 UTC brand2
4: 1 2013-09-04 09:08:00 UTC brand1
5: 1 2013-09-05 09:11:00 UTC brand1
6: 1 2013-09-06 09:14:00 UTC brand2
7: 1 2013-09-07 09:26:00 UTC brand6
8: 1 2013-09-08 09:26:00 UTC brand2
9: 1 2013-09-09 09:29:00 UTC brand2
10: 1 2013-09-09 09:32:00 UTC brand4
11: 2 2013-09-02 08:54:00 UTC brand3
12: 2 2013-09-04 09:08:00 UTC brand1
13: 2 2013-09-06 09:14:00 UTC brand2
14: 2 2013-09-08 09:26:00 UTC brand2
15: 2 2013-09-09 09:32:00 UTC brand4
现在,我们可以修改现有的解决方案以考虑不同的客户:
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, customer, shoptime) # not required, just for convenience of observers
selected_brands <- c("brand1", "brand2")
result <- dat[
.(customer = customer, lb = shoptime - hours(5 * 24), ub = shoptime),
on = .(customer, shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = chosenbrand][, rank := frank(-N, ties.method="dense")]][
chosenbrand %in% selected_brands,
dcast(unique(.SD[, -2]), customer + shoptime ~ paste0("nth_most_freq_", chosenbrand),
value.var = "rank")][
dat, on = .(customer, shoptime)]
# change column order to make it look more similar to the expected answer
setcolorder(result, c(1:2, 5, 3:4))
result
customer shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2
1: 1 2013-09-01 08:35:00 brand3 NA NA
2: 1 2013-09-02 08:54:00 brand3 NA NA
3: 1 2013-09-03 09:07:00 brand2 NA NA
4: 1 2013-09-04 09:08:00 brand1 NA 2
5: 1 2013-09-05 09:11:00 brand1 2 2
6: 1 2013-09-06 09:14:00 brand2 1 2
7: 1 2013-09-07 09:26:00 brand6 1 1
8: 1 2013-09-08 09:26:00 brand2 1 2
9: 1 2013-09-09 09:29:00 brand2 2 1
10: 1 2013-09-09 09:32:00 brand4 2 1
11: 2 2013-09-02 08:54:00 brand3 NA NA
12: 2 2013-09-04 09:08:00 brand1 NA NA
13: 2 2013-09-06 09:14:00 brand2 1 NA
14: 2 2013-09-08 09:26:00 brand2 1 1
15: 2 2013-09-09 09:32:00 brand4 NA 1
我的数据包含时间变量和选择的品牌变量,如下所示。 time表示购物时间,chosenbrand表示当时购买的品牌。
根据这些数据,我想在下方创建第三列和第四列 table。在这里创建列有一些规则。第三(第四)列表示品牌 1(品牌 2)根据 5 天内的选择频率排名。如果 5 天内没有历史记录,那么它应该是 NA。
比如,我们来看第5行。第 5 行的 shoptime
是 2013-09-05 09:11:00
那么第 5 天 window 是 2013-08-31 09:11:00
~ 2013-09-05 09:11:00
。在这段时间内,出现了brand3、bradn3、brand2、
和 brand1(不包括第 5 行的 chosenbrand
)。根据最常选择的 brand1
(第三列)的排名是第二,brand2
的排名也是第二。所以第 5 行的两列都应该是 2 和 2。
再举个例子,让我们看看下面的最后一行table。该行的 shoptime
是 2013-09-09 09:32:00
然后第 5 天 window 是 2013-09-04 09:32:00
~ 2013-09-09 09:32:00
。在这个时间段内,出现了 brand1、bradn2、brand6、brand2 和 brand2(不包括该行的 chosenbrand
)。 brand1
(第三列)的排名,基于最常选择的,是第二位的,brand2
的排名是第一位的。所以行中的两列都应该是 2 和 1。
有什么简单的方法吗?
另外,如果我想单独做(如果每个客户都有多个购买记录),怎么办?
数据如下,
shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2
2013-09-01 08:35:00 brand3 NA NA
2013-09-02 08:54:00 brand3 NA NA
2013-09-03 09:07:00 brand2 NA NA
2013-09-04 09:08:00 brand1 NA 2
2013-09-05 09:11:00 brand1 2 2
2013-09-06 09:14:00 brand2 1 2
2013-09-07 09:26:00 brand6 1 1
2013-09-08 09:26:00 brand2 1 2
2013-09-09 09:29:00 brand2 2 1
2013-09-09 09:32:00 brand4 2 1
这是数据的代码
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
nth_most_freq_brand1 = NA,
nth_most_freq_brand2 = NA,
stringsAsFactors = FALSE)
使用tidyverse and lubridate的解决方案。
OP 的第一个问题
library(tidyverse)
library(lubridate)
第 1 步:将 shoptime
列转换为日期时间对象
dat <- dat %>% mutate(shoptime = ymd_hms(shoptime))
步骤 2:为所有 shoptime
创建查找 table。
complete
函数可以创建列之间的所有组合,因此我们可以创建shoptime
列(shoptime1
)的副本并创建所有组合。然后我们可以使用 filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
来查找日期和时间是否在 5 天之内。
dat2 <- dat %>%
mutate(shoptime1 = shoptime) %>%
select(contains("shoptime")) %>%
complete(shoptime, shoptime1) %>%
filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
第三步:合并dat
与查询table,统计品牌,对统计数字进行排序。
我们可以在 shoptime1
和 shoptime
的基础上合并查找 table、dat2
和 dat
。 count
函数可以按组统计出现的次数。之后,我们可以对 shoptime
进行分组,并使用 dense_rank
创建每个品牌在每个组中的排名。
dat3 <- dat2 %>%
left_join(dat, by = c("shoptime1" = "shoptime")) %>%
count(shoptime, chosenbrand) %>%
group_by(shoptime) %>%
mutate(rank = dense_rank(desc(n))) %>%
select(-n) %>%
spread(chosenbrand, rank) %>%
select(shoptime, brand1, brand2)
第四步:将原始数据框与dat3
数据框合并。
dat4 <- dat %>% left_join(dat3, by = "shoptime")
这是最终结果。
dat4
# shoptime chosenbrand brand1 brand2
# 1 2013-09-01 08:35:00 brand3 NA NA
# 2 2013-09-02 08:54:00 brand3 NA NA
# 3 2013-09-03 09:07:00 brand2 NA NA
# 4 2013-09-04 09:08:00 brand1 NA 2
# 5 2013-09-05 09:11:00 brand1 2 2
# 6 2013-09-06 09:14:00 brand2 1 2
# 7 2013-09-07 09:26:00 brand6 1 1
# 8 2013-09-08 09:26:00 brand2 1 2
# 9 2013-09-09 09:29:00 brand2 2 1
# 10 2013-09-09 09:32:00 brand4 2 1
OP 的第二个问题
由于OP没有提供示例数据集,我将使用示例数据集customer
列视为分组变量。
这是创建示例数据集的代码。我只在最后添加 as.tibble
将 data.table
对象转换为 tibble
.
library(data.table)
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
dat <- rbindlist(list(dat, dat[c(FALSE, TRUE), ]), idcol = "customer")
dat <- as.tibble(dat)
第 1 步:将 shoptime
列转换为日期时间对象
dat <- dat %>% mutate(shoptime = ymd_hms(shoptime))
步骤 2:为所有 shoptime
创建查找 table。
请注意,除了我们需要在应用 complete
函数之前对 customer
进行分组外,代码与上一个几乎相同。
dat2 <- dat %>%
mutate(shoptime1 = shoptime) %>%
select(contains("shoptime"), customer) %>%
group_by(customer) %>%
complete(shoptime, shoptime1) %>%
filter(shoptime1 > shoptime - hours(5 * 24), shoptime1 < shoptime)
第三步:合并dat
与查询table,统计品牌,对统计数字进行排序。
同样,我们在进行join操作和统计品牌时,需要考虑customer
列。
dat3 <- dat2 %>%
left_join(dat, by = c("customer", "shoptime1" = "shoptime")) %>%
count(customer, shoptime, chosenbrand) %>%
group_by(customer, shoptime) %>%
mutate(rank = dense_rank(-n)) %>%
select(-n) %>%
spread(chosenbrand, rank) %>%
select(customer, shoptime, brand1, brand2)
第四步:将原始数据框与dat3
数据框合并。
dat4 <- dat %>% left_join(dat3, by = c("customer", "shoptime"))
这是最终结果。我添加 as.data.frame
只是为了以更简单的格式打印输出。
dat4 %>% as.data.frame()
# customer shoptime chosenbrand brand1 brand2
# 1 1 2013-09-01 08:35:00 brand3 NA NA
# 2 1 2013-09-02 08:54:00 brand3 NA NA
# 3 1 2013-09-03 09:07:00 brand2 NA NA
# 4 1 2013-09-04 09:08:00 brand1 NA 2
# 5 1 2013-09-05 09:11:00 brand1 2 2
# 6 1 2013-09-06 09:14:00 brand2 1 2
# 7 1 2013-09-07 09:26:00 brand6 1 1
# 8 1 2013-09-08 09:26:00 brand2 1 2
# 9 1 2013-09-09 09:29:00 brand2 2 1
# 10 1 2013-09-09 09:32:00 brand4 2 1
# 11 2 2013-09-02 08:54:00 brand3 NA NA
# 12 2 2013-09-04 09:08:00 brand1 NA NA
# 13 2 2013-09-06 09:14:00 brand2 1 NA
# 14 2 2013-09-08 09:26:00 brand2 1 1
# 15 2 2013-09-09 09:32:00 brand4 NA 1
OP 问了一个非常相似的问题
- 5 天而不是 36 小时的扩展时间范围(请注意,OP 指的是 时间段,而不是日期段)
- 只考虑
brand1
和brand2
(而不是chosenbrands
的所有值)。
因此,
library(data.table)
library(lubridate)
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
selected_brands <- c("brand1", "brand2")
result <- dat[
.(lb = shoptime - hours(5 * 24), ub = shoptime),
on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = chosenbrand][, rank := frank(-N, ties.method="dense")]][
chosenbrand %in% selected_brands,
dcast(unique(.SD[, -1]), shoptime ~ paste0("nth_most_freq_", chosenbrand),
value.var = "rank")][
dat, on = "shoptime"]
# change column order to make it look more similar to the expected answer
setcolorder(result, c(1, 4, 2:3))
result
shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2 1: 2013-09-01 08:35:00 brand3 NA NA 2: 2013-09-02 08:54:00 brand3 NA NA 3: 2013-09-03 09:07:00 brand2 NA NA 4: 2013-09-04 09:08:00 brand1 NA 2 5: 2013-09-05 09:11:00 brand1 2 2 6: 2013-09-06 09:14:00 brand2 1 2 7: 2013-09-07 09:26:00 brand6 1 1 8: 2013-09-08 09:26:00 brand2 1 2 9: 2013-09-09 09:29:00 brand2 2 1 10: 2013-09-09 09:32:00 brand4 2 1
OP 第二个问题的答案
OP 问了一个额外的问题:
In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?
不幸的是,OP 没有为此案例提供示例数据集。因此,我们需要根据提供的数据集为两个客户组成一个数据集:
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-02 08:54:00 UTC", "2013-09-03 09:07:00 UTC" ,"2013-09-04 09:08:00 UTC", "2013-09-05 09:11:00 UTC", "2013-09-06 09:14:00 UTC",
"2013-09-07 09:26:00 UTC", "2013-09-08 09:26:00 UTC" ,"2013-09-09 09:29:00 UTC", "2013-09-09 09:32:00 UTC"),
chosenbrand = c("brand3", "brand3", "brand2", "brand1", "brand1", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
dat <- rbindlist(list(dat, dat[c(FALSE, TRUE), ]), idcol = "customer")
dat
customer shoptime chosenbrand 1: 1 2013-09-01 08:35:00 UTC brand3 2: 1 2013-09-02 08:54:00 UTC brand3 3: 1 2013-09-03 09:07:00 UTC brand2 4: 1 2013-09-04 09:08:00 UTC brand1 5: 1 2013-09-05 09:11:00 UTC brand1 6: 1 2013-09-06 09:14:00 UTC brand2 7: 1 2013-09-07 09:26:00 UTC brand6 8: 1 2013-09-08 09:26:00 UTC brand2 9: 1 2013-09-09 09:29:00 UTC brand2 10: 1 2013-09-09 09:32:00 UTC brand4 11: 2 2013-09-02 08:54:00 UTC brand3 12: 2 2013-09-04 09:08:00 UTC brand1 13: 2 2013-09-06 09:14:00 UTC brand2 14: 2 2013-09-08 09:26:00 UTC brand2 15: 2 2013-09-09 09:32:00 UTC brand4
现在,我们可以修改现有的解决方案以考虑不同的客户:
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, customer, shoptime) # not required, just for convenience of observers
selected_brands <- c("brand1", "brand2")
result <- dat[
.(customer = customer, lb = shoptime - hours(5 * 24), ub = shoptime),
on = .(customer, shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = chosenbrand][, rank := frank(-N, ties.method="dense")]][
chosenbrand %in% selected_brands,
dcast(unique(.SD[, -2]), customer + shoptime ~ paste0("nth_most_freq_", chosenbrand),
value.var = "rank")][
dat, on = .(customer, shoptime)]
# change column order to make it look more similar to the expected answer
setcolorder(result, c(1:2, 5, 3:4))
result
customer shoptime chosenbrand nth_most_freq_brand1 nth_most_freq_brand2 1: 1 2013-09-01 08:35:00 brand3 NA NA 2: 1 2013-09-02 08:54:00 brand3 NA NA 3: 1 2013-09-03 09:07:00 brand2 NA NA 4: 1 2013-09-04 09:08:00 brand1 NA 2 5: 1 2013-09-05 09:11:00 brand1 2 2 6: 1 2013-09-06 09:14:00 brand2 1 2 7: 1 2013-09-07 09:26:00 brand6 1 1 8: 1 2013-09-08 09:26:00 brand2 1 2 9: 1 2013-09-09 09:29:00 brand2 2 1 10: 1 2013-09-09 09:32:00 brand4 2 1 11: 2 2013-09-02 08:54:00 brand3 NA NA 12: 2 2013-09-04 09:08:00 brand1 NA NA 13: 2 2013-09-06 09:14:00 brand2 1 NA 14: 2 2013-09-08 09:26:00 brand2 1 1 15: 2 2013-09-09 09:32:00 brand4 NA 1