轻松检查其他变量中是否记录了objective？

Question

我有客户 ID，product_id1 和 product_id2。数据有客户的购买信息，并按客户和时间排序，因此每个客户的第一行是最旧的记录。
product_id1 包括购买的物品 product_id2 包含我想知道每件商品之前是否购买过的商品（从 product_id1 查询）。

对于每个客户 product_id2 中的每个项目，我想创建一个虚拟变量来指示 product_id2 中的每个项目是否在过去购买过
换句话说， product_id2 中第 n 行的值在 product_id1 的第 1~(n-1) 行中至少显示一次，purchased_before 为 1，否则为 0。

所以我想创建 "purchased_before" 列。

我可以使用 for 循环来完成，但是有什么有效的方法吗？

数据如下，

customer id      product_id1   product_id2     purchased_before
    1             112             113                 0
    1             115             114                 0
    1             113             113                 0
    1             114             113                 1
    1             115             114                 1
    ....
    2             112             115                 0
    2             115             112                 1
    2             113             113                 0

Answer 1

尝试以下操作：

dplyr:

df %>%
    group_by(customer_id) %>%
    mutate(purchased_before = sapply(row_number(), function(x) {
               ifelse(product_id2[x] %in% product_id1[1:(x-1)], 1, 0)
           })
    )

base R:

do.call(rbind, lapply(split(df, df$customer_id), function(x) {
    x$purchased_before <- sapply(seq_len(nrow(x)), function(y) {
        ifelse(x$product_id2[y] %in% x$product_id1[1:(y-1)], 1, 0)
    })
    x
}))

这里的主要内容是遍历 product_id2 列中的行号，并使用这些行号访问给定索引处的 product_id2 值以及 product_id1 值从 1 到给定的索引。获得这些值后，您可以在 ifelse 中执行简单的 match 运算符。如果匹配，则分配 1；或者 0 否则。

希望这有用。

Answer 2

这可以使用 非等值连接 并在连接时聚合来解决：

library(data.table)
setDT(DT)[
  # add "time variable", i.e., row id to identify earlier purchases
  , rn := .I][
    # create new column with ...
    , cnt_of_earlier_purchases := 
      # ... the result of the non-equi join aggregate
      DT[DT, on = .(customer_id, product_id1 = product_id2, rn < rn), .N, by = .EACHI]$N][]

   customer_id product_id1 product_id2 rn cnt_of_earlier_purchases
1:           1         112         113  1                        0
2:           1         115         114  2                        0
3:           1         113         113  3                        0
4:           1         114         113  4                        1
5:           1         115         114  5                        1
6:           2         112         115  6                        0
7:           2         115         112  7                        1
8:           2         112         113  8                        0
9:           2         115         112  9                        2

新列包含实际购买前的购买次数。

请注意，已使用修改后的包含多次购买的示例数据集来演示计算购买的效果。

或者，可以附加一个逻辑值而不是计数：

setDT(DT)[, rn := .I][
  , purchased_before := 
    DT[DT, on = .(customer_id, product_id1 = product_id2, rn < rn), .N, by = .EACHI]$N > 0][]

   customer_id product_id1 product_id2 rn purchased_before
1:           1         112         113  1            FALSE
2:           1         115         114  2            FALSE
3:           1         113         113  3            FALSE
4:           1         114         113  4             TRUE
5:           1         115         114  5             TRUE
6:           2         112         115  6            FALSE
7:           2         115         112  7             TRUE
8:           2         112         113  8            FALSE
9:           2         115         112  9             TRUE

数据

library(data.table)
DT <- fread(
"customer_id      product_id1   product_id2     purchased_before
    1             112             113                 0
    1             115             114                 0
    1             113             113                 0
    1             114             113                 1
    1             115             114                 1
    2             112             115                 0
    2             115             112                 1
    2             112             113                 0
    2             115             112                 0", select = 1:3)

轻松检查其他变量中是否记录了objective？

Easily check whether an objective was recorded in other variable?

r

plyr

dataframe

dplyr

data.table

数据