R Lag/Lead on Date Column Identification
R Lag/Lead on Date Column Identification
我想在我拥有的数据集中创建一个新的标识符列。
ex <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002"), serv_from_dt = structure(c(18262,
18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"),
serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268,
18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263,
18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date")),
row.names = c(NA, -9L), class = c("data.table", "data.frame"))
此标识符将基于 serv_to_date、serv_from_date 和 date_plus1 列。数据按serv_from_date排序;如果下一行的 ser_to_date 等于上一行的 serv_from_date 或 serv_to_date 等于上一行的 serv_from_date+1 (即 date_plus1 列)然后用 1 个标识符标记这些行。
我想要的最终输出是:
want <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002"), serv_from_dt = structure(c(18262,
18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"),
serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268,
18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263,
18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date"),
identifier = c("1", "1", "2",
"2", "2", "2", "2",
"3", "3")), row.names = c(NA, -9L), class = c("data.table", "data.frame"))
我的第一步是创建一个列来标识滞后日期与前一行的日期:
ex %>%
mutate(NewCol = ifelse((lag(serv_from_dt) == date_plus1 | lag(serv_from_dt) == serv_to_dt), "yes", "no"))
但是,此代码无法正确地对与前一行的 date_plus1 相匹配的 serv_from_date 说“是”。
在此先感谢您提供的任何帮助!
以下使用cumsum
的逻辑只会在serv_to_dt
不等于serv_from_dt
和date_plus1
的滞后值时递增。 row_number() == 1
累计和从 1 开始。
library(dplyr)
ex %>%
mutate(identifier = cumsum((serv_to_dt != lag(serv_from_dt) & serv_to_dt != lag(date_plus1)) | row_number() == 1))
输出
id serv_from_dt serv_to_dt date_plus1 identifier
1 8210109300002 2020-01-01 2020-01-01 2020-01-02 1
2 8210109300002 2020-01-02 2020-01-02 2020-01-03 1
3 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
4 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
5 8210109300002 2020-01-07 2020-01-07 2020-01-08 2
6 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
7 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
8 8210109300002 2020-01-14 2020-01-14 2020-01-15 3
9 8210109300002 2020-01-15 2020-01-15 2020-01-16 3
你的逻辑很好,你只是错过了最后一步:我们需要对“是”值进行累积计数,cumsum
。
实际上,如果我们跳过 ifelse
并将结果保留为 TRUE/FALSE 而不是“是”/“否”,并使用一个不错的默认值来确保第一个行为真。
want %>%
mutate(NewCol = cumsum(
lag(serv_from_dt, default = first(date_plus1)) == date_plus1 |
lag(serv_from_dt) == serv_to_dt)
)
# id serv_from_dt serv_to_dt date_plus1 identifier NewCol
# 1 8210109300002 2020-01-01 2020-01-01 2020-01-02 1 1
# 2 8210109300002 2020-01-02 2020-01-02 2020-01-03 1 1
# 3 8210109300002 2020-01-06 2020-01-06 2020-01-07 2 1
# 4 8210109300002 2020-01-06 2020-01-06 2020-01-07 2 2
# 5 8210109300002 2020-01-07 2020-01-07 2020-01-08 2 2
# 6 8210109300002 2020-01-08 2020-01-08 2020-01-09 2 2
# 7 8210109300002 2020-01-08 2020-01-08 2020-01-09 2 3
# 8 8210109300002 2020-01-14 2020-01-14 2020-01-15 3 3
# 9 8210109300002 2020-01-15 2020-01-15 2020-01-16 3 3
与data.table
:
library(data.table)
setDT(ex)
ex[,identifier:=cumsum(!(serv_to_dt == shift(serv_from_dt,1,fill = FALSE)|serv_to_dt == shift(serv_from_dt,1,fill=FALSE)+1))][]
id serv_from_dt serv_to_dt date_plus1 identifier
1: 8210109300002 2020-01-01 2020-01-01 2020-01-02 1
2: 8210109300002 2020-01-02 2020-01-02 2020-01-03 1
3: 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
4: 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
5: 8210109300002 2020-01-07 2020-01-07 2020-01-08 2
6: 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
7: 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
8: 8210109300002 2020-01-14 2020-01-14 2020-01-15 3
9: 8210109300002 2020-01-15 2020-01-15 2020-01-16 3
我想在我拥有的数据集中创建一个新的标识符列。
ex <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002"), serv_from_dt = structure(c(18262,
18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"),
serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268,
18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263,
18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date")),
row.names = c(NA, -9L), class = c("data.table", "data.frame"))
此标识符将基于 serv_to_date、serv_from_date 和 date_plus1 列。数据按serv_from_date排序;如果下一行的 ser_to_date 等于上一行的 serv_from_date 或 serv_to_date 等于上一行的 serv_from_date+1 (即 date_plus1 列)然后用 1 个标识符标记这些行。
我想要的最终输出是:
want <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002"), serv_from_dt = structure(c(18262,
18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"),
serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268,
18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263,
18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date"),
identifier = c("1", "1", "2",
"2", "2", "2", "2",
"3", "3")), row.names = c(NA, -9L), class = c("data.table", "data.frame"))
我的第一步是创建一个列来标识滞后日期与前一行的日期:
ex %>%
mutate(NewCol = ifelse((lag(serv_from_dt) == date_plus1 | lag(serv_from_dt) == serv_to_dt), "yes", "no"))
但是,此代码无法正确地对与前一行的 date_plus1 相匹配的 serv_from_date 说“是”。
在此先感谢您提供的任何帮助!
以下使用cumsum
的逻辑只会在serv_to_dt
不等于serv_from_dt
和date_plus1
的滞后值时递增。 row_number() == 1
累计和从 1 开始。
library(dplyr)
ex %>%
mutate(identifier = cumsum((serv_to_dt != lag(serv_from_dt) & serv_to_dt != lag(date_plus1)) | row_number() == 1))
输出
id serv_from_dt serv_to_dt date_plus1 identifier
1 8210109300002 2020-01-01 2020-01-01 2020-01-02 1
2 8210109300002 2020-01-02 2020-01-02 2020-01-03 1
3 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
4 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
5 8210109300002 2020-01-07 2020-01-07 2020-01-08 2
6 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
7 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
8 8210109300002 2020-01-14 2020-01-14 2020-01-15 3
9 8210109300002 2020-01-15 2020-01-15 2020-01-16 3
你的逻辑很好,你只是错过了最后一步:我们需要对“是”值进行累积计数,cumsum
。
实际上,如果我们跳过 ifelse
并将结果保留为 TRUE/FALSE 而不是“是”/“否”,并使用一个不错的默认值来确保第一个行为真。
want %>%
mutate(NewCol = cumsum(
lag(serv_from_dt, default = first(date_plus1)) == date_plus1 |
lag(serv_from_dt) == serv_to_dt)
)
# id serv_from_dt serv_to_dt date_plus1 identifier NewCol
# 1 8210109300002 2020-01-01 2020-01-01 2020-01-02 1 1
# 2 8210109300002 2020-01-02 2020-01-02 2020-01-03 1 1
# 3 8210109300002 2020-01-06 2020-01-06 2020-01-07 2 1
# 4 8210109300002 2020-01-06 2020-01-06 2020-01-07 2 2
# 5 8210109300002 2020-01-07 2020-01-07 2020-01-08 2 2
# 6 8210109300002 2020-01-08 2020-01-08 2020-01-09 2 2
# 7 8210109300002 2020-01-08 2020-01-08 2020-01-09 2 3
# 8 8210109300002 2020-01-14 2020-01-14 2020-01-15 3 3
# 9 8210109300002 2020-01-15 2020-01-15 2020-01-16 3 3
与data.table
:
library(data.table)
setDT(ex)
ex[,identifier:=cumsum(!(serv_to_dt == shift(serv_from_dt,1,fill = FALSE)|serv_to_dt == shift(serv_from_dt,1,fill=FALSE)+1))][]
id serv_from_dt serv_to_dt date_plus1 identifier
1: 8210109300002 2020-01-01 2020-01-01 2020-01-02 1
2: 8210109300002 2020-01-02 2020-01-02 2020-01-03 1
3: 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
4: 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
5: 8210109300002 2020-01-07 2020-01-07 2020-01-08 2
6: 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
7: 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
8: 8210109300002 2020-01-14 2020-01-14 2020-01-15 3
9: 8210109300002 2020-01-15 2020-01-15 2020-01-16 3