查找前期价值的最快方法？

Question

我有一个家庭零售产品购买数据集。对于每次家庭旅行，我想查看该家庭在该次旅行期间购买的任何品牌是否是在前一段时间购买的——如果是，则 loyal=1 否则 loyal=0。我有一个包含数十亿个观测值的大型数据集，因此效率越高越好。 :)

library(data.table)
household <-  as.integer(c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3))
trip      <- as.integer(c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
brand     <- as.integer(c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2))
DT <- data.table(household,trip,brand)

期望的输出：

> DT
             household trip loyal
[1,]         1        1    NA
[2,]         1        2     0
[3,]         1        3     1
[4,]         2        4    NA
[5,]         2        5     0
[6,]         2        6     0
[7,]         3        7    NA
[8,]         3        8     1
[9,]         3        9     1

我试过类似的方法，但显然没有产生所需的输出。

DT$loyal <- 0
for (h in unique(DT$household)){
  for (t in unique(DT$trip)){
    DT[brand %in% (DT[trip=t-1]$brand)]$loyal <- 1
  }}

Answer 1

您可以自行加入以获得索引，然后再次加入 household 和 trip 的唯一组合。想到这个

# Create a column of the previous trip
DT[, prev_trip := trip - 1L]

# Self join
indx <- 
  DT[DT 
   ,.(household, trip)
   ,on = .(household, prev_trip = trip, brand)
   ,nomatch = 0L]

# A unique combination `household` and `trip` joined with the index
res <- unique(DT[, .(household, trip)])[indx, on = .(household, trip), loyal := 1L]
res
#    household trip loyal
# 1:         1    1    NA
# 2:         1    2    NA
# 3:         1    3     1
# 4:         2    4    NA
# 5:         2    5    NA
# 6:         2    6    NA
# 7:         3    7    NA
# 8:         3    8     1
# 9:         3    9     1

不确定那里的 0 是否重要，因为它们对我来说信息量不大，但如果需要，稍后可以轻松添加它们

查找前期价值的最快方法？

Fastest way to look up value in previous period?

lookup

benchmarking

r

data.table