使用重叠截断剧集
Using foverlaps to truncate episodes
我正在努力思考如何使用 data.table::foverlaps()
生成新数据 table。在一个应用程序中,我想使用 foverlaps 来识别间隙,然后使用此信息来截断我的原始数据 table.
假设我有一个数据集 (df1
),其中包含一家公司的 2 名员工 (id
),期间的日期范围为 (start_date
和 end_date
)他们在不同的项目中工作
(proj_id
;“A”、“B”或“C”)。
library(data.table)
library(lubridate)
df1<-data.table(id = rep(1:2,each=3),
start_date = ymd(c("1998-04-03","1999-03-08","2000-08-13",
"2005-03-03","2007-10-12","2014-02-23")),
end_date = ymd(c("1999-03-07","2000-08-12","2021-04-23",
"2007-09-05","2014-02-22","2019-05-04")),
proj_id = c("A","B","A","B","C","A"))
> df1
id start_date end_date proj_id
1: 1 1998-04-03 1999-03-07 A
2: 1 1999-03-08 2000-08-12 B
3: 1 2000-08-13 2021-04-23 A
4: 2 2005-03-03 2007-09-05 B
5: 2 2007-10-12 2014-02-22 C
6: 2 2014-02-23 2019-05-04 A
现在我有另一个数据集 (df2
),它指定了我想从 df1
截断的时间。
df2 <- data.table(id = 1:2,
start_date = ymd("1998-07-20", "2006-06-12"),
end_date = ymd("1998-08-15", "2016-04-08"))
> df2
id start_date end_date
1: 1 1998-07-20 1998-08-15
2: 2 2006-06-12 2016-04-08
然后我可以使用 data.table::foverlaps()
来识别重叠的剧集:
> setkey(df1,id,start_date,end_date)
> foverlaps(df2, df1, type="any",
+ by.x=c("id","start_date","end_date"))
id start_date end_date proj_id i.start_date i.end_date
1: 1 1998-04-03 1999-03-07 A 1998-07-20 1998-08-15
2: 2 2005-03-03 2007-09-05 B 2006-06-12 2016-04-08
3: 2 2007-10-12 2014-02-22 C 2006-06-12 2016-04-08
4: 2 2014-02-23 2019-05-04 A 2006-06-12 2016-04-08
我现在想使用此数据生成新版本的 df1
,我通过截断上面确定的差距来生成新剧集。因此,我想要的 DT 是:
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A
```
可能有更好的替代方法,但这可能会根据您的 foverlaps
结果起作用。
假设您使用 foverlaps
结果创建了另一个名为 df3
的 data.table:
df3 <- foverlaps(df2, df1, type = "any", by.x = c("id", "start_date", "end_date"))
然后您可以遍历每一行,并根据重叠添加 0、1 或 2 个日期范围(在结尾或开头截断,或者整个范围被遮挡)。
dt <- data.table(start_date = Date(), end_date = Date(), id = numeric(), proj_id = numeric())
for (i in seq_len(nrow(df3))) {
if (df3$start_date[i] < df3$i.start_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$start_date[i], end_date = df3$i.start_date[i] - 1, id = df3$id[i], proj_id = df3$proj_id[i]))
}
if (df3$end_date[i] > df3$i.end_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$i.end_date[i] + 1, end_date = df3$end_date[i], id = df3$id[i], proj_id = df3$proj_id[i]))
}
}
最后,您可以从初始 df1
中删除 foverlaps
结果,因为已经确定了这些结果的新范围(使用 fsetdiff
)。然后,您可以添加新的范围。
rbind(fsetdiff(df1, df3[,1:4]), dt)[order(id, start_date)]
输出
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A
我正在努力思考如何使用 data.table::foverlaps()
生成新数据 table。在一个应用程序中,我想使用 foverlaps 来识别间隙,然后使用此信息来截断我的原始数据 table.
假设我有一个数据集 (df1
),其中包含一家公司的 2 名员工 (id
),期间的日期范围为 (start_date
和 end_date
)他们在不同的项目中工作
(proj_id
;“A”、“B”或“C”)。
library(data.table)
library(lubridate)
df1<-data.table(id = rep(1:2,each=3),
start_date = ymd(c("1998-04-03","1999-03-08","2000-08-13",
"2005-03-03","2007-10-12","2014-02-23")),
end_date = ymd(c("1999-03-07","2000-08-12","2021-04-23",
"2007-09-05","2014-02-22","2019-05-04")),
proj_id = c("A","B","A","B","C","A"))
> df1
id start_date end_date proj_id
1: 1 1998-04-03 1999-03-07 A
2: 1 1999-03-08 2000-08-12 B
3: 1 2000-08-13 2021-04-23 A
4: 2 2005-03-03 2007-09-05 B
5: 2 2007-10-12 2014-02-22 C
6: 2 2014-02-23 2019-05-04 A
现在我有另一个数据集 (df2
),它指定了我想从 df1
截断的时间。
df2 <- data.table(id = 1:2,
start_date = ymd("1998-07-20", "2006-06-12"),
end_date = ymd("1998-08-15", "2016-04-08"))
> df2
id start_date end_date
1: 1 1998-07-20 1998-08-15
2: 2 2006-06-12 2016-04-08
然后我可以使用 data.table::foverlaps()
来识别重叠的剧集:
> setkey(df1,id,start_date,end_date)
> foverlaps(df2, df1, type="any",
+ by.x=c("id","start_date","end_date"))
id start_date end_date proj_id i.start_date i.end_date
1: 1 1998-04-03 1999-03-07 A 1998-07-20 1998-08-15
2: 2 2005-03-03 2007-09-05 B 2006-06-12 2016-04-08
3: 2 2007-10-12 2014-02-22 C 2006-06-12 2016-04-08
4: 2 2014-02-23 2019-05-04 A 2006-06-12 2016-04-08
我现在想使用此数据生成新版本的 df1
,我通过截断上面确定的差距来生成新剧集。因此,我想要的 DT 是:
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A
```
可能有更好的替代方法,但这可能会根据您的 foverlaps
结果起作用。
假设您使用 foverlaps
结果创建了另一个名为 df3
的 data.table:
df3 <- foverlaps(df2, df1, type = "any", by.x = c("id", "start_date", "end_date"))
然后您可以遍历每一行,并根据重叠添加 0、1 或 2 个日期范围(在结尾或开头截断,或者整个范围被遮挡)。
dt <- data.table(start_date = Date(), end_date = Date(), id = numeric(), proj_id = numeric())
for (i in seq_len(nrow(df3))) {
if (df3$start_date[i] < df3$i.start_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$start_date[i], end_date = df3$i.start_date[i] - 1, id = df3$id[i], proj_id = df3$proj_id[i]))
}
if (df3$end_date[i] > df3$i.end_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$i.end_date[i] + 1, end_date = df3$end_date[i], id = df3$id[i], proj_id = df3$proj_id[i]))
}
}
最后,您可以从初始 df1
中删除 foverlaps
结果,因为已经确定了这些结果的新范围(使用 fsetdiff
)。然后,您可以添加新的范围。
rbind(fsetdiff(df1, df3[,1:4]), dt)[order(id, start_date)]
输出
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A