R Lag/Lead on Date Column Identification

Question

我想在我拥有的数据集中创建一个新的标识符列。

ex <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002"), serv_from_dt = structure(c(18262, 
 18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"), 
 serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268, 
 18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263, 
 18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date")), 
 row.names = c(NA, -9L), class = c("data.table", "data.frame"))

此标识符将基于 serv_to_date、serv_from_date 和 date_plus1 列。数据按serv_from_date排序；如果下一行的 ser_to_date 等于上一行的 serv_from_date 或 serv_to_date 等于上一行的 serv_from_date+1 （即 date_plus1 列）然后用 1 个标识符标记这些行。

我想要的最终输出是：

want <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002"), serv_from_dt = structure(c(18262, 
 18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"), 
 serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268, 
 18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263, 
 18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date"),
 identifier = c("1", "1", "2", 
 "2", "2", "2", "2", 
 "3", "3")), row.names = c(NA, -9L), class = c("data.table", "data.frame"))

我的第一步是创建一个列来标识滞后日期与前一行的日期：

ex %>% 
  mutate(NewCol = ifelse((lag(serv_from_dt) == date_plus1 | lag(serv_from_dt) == serv_to_dt), "yes", "no"))

但是，此代码无法正确地对与前一行的 date_plus1 相匹配的 serv_from_date 说“是”。

在此先感谢您提供的任何帮助！

Answer 1

以下使用cumsum的逻辑只会在serv_to_dt不等于serv_from_dt和date_plus1的滞后值时递增。 row_number() == 1 累计和从 1 开始。

library(dplyr)

ex %>% 
  mutate(identifier = cumsum((serv_to_dt != lag(serv_from_dt) & serv_to_dt != lag(date_plus1)) | row_number() == 1))

输出

             id serv_from_dt serv_to_dt date_plus1 identifier
1 8210109300002   2020-01-01 2020-01-01 2020-01-02          1
2 8210109300002   2020-01-02 2020-01-02 2020-01-03          1
3 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
4 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
5 8210109300002   2020-01-07 2020-01-07 2020-01-08          2
6 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
7 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
8 8210109300002   2020-01-14 2020-01-14 2020-01-15          3
9 8210109300002   2020-01-15 2020-01-15 2020-01-16          3

Answer 2

你的逻辑很好，你只是错过了最后一步：我们需要对“是”值进行累积计数，cumsum。

实际上，如果我们跳过 ifelse 并将结果保留为 TRUE/FALSE 而不是“是”/“否”，并使用一个不错的默认值来确保第一个行为真。

want %>% 
  mutate(NewCol = cumsum(
    lag(serv_from_dt, default = first(date_plus1)) == date_plus1 |
      lag(serv_from_dt) == serv_to_dt)
  )
#              id serv_from_dt serv_to_dt date_plus1 identifier NewCol
# 1 8210109300002   2020-01-01 2020-01-01 2020-01-02          1      1
# 2 8210109300002   2020-01-02 2020-01-02 2020-01-03          1      1
# 3 8210109300002   2020-01-06 2020-01-06 2020-01-07          2      1
# 4 8210109300002   2020-01-06 2020-01-06 2020-01-07          2      2
# 5 8210109300002   2020-01-07 2020-01-07 2020-01-08          2      2
# 6 8210109300002   2020-01-08 2020-01-08 2020-01-09          2      2
# 7 8210109300002   2020-01-08 2020-01-08 2020-01-09          2      3
# 8 8210109300002   2020-01-14 2020-01-14 2020-01-15          3      3
# 9 8210109300002   2020-01-15 2020-01-15 2020-01-16          3      3

Answer 3

与data.table:

library(data.table)

setDT(ex)

ex[,identifier:=cumsum(!(serv_to_dt == shift(serv_from_dt,1,fill = FALSE)|serv_to_dt == shift(serv_from_dt,1,fill=FALSE)+1))][]

              id serv_from_dt serv_to_dt date_plus1 identifier
1: 8210109300002   2020-01-01 2020-01-01 2020-01-02          1
2: 8210109300002   2020-01-02 2020-01-02 2020-01-03          1
3: 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
4: 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
5: 8210109300002   2020-01-07 2020-01-07 2020-01-08          2
6: 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
7: 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
8: 8210109300002   2020-01-14 2020-01-14 2020-01-15          3
9: 8210109300002   2020-01-15 2020-01-15 2020-01-16          3

R Lag/Lead on Date Column Identification

R Lag/Lead on Date Column Identification

if-statement

r

dataframe

data.table

tidyverse