R 滞后于有条件的日期
R lagging through dates with conditions
我有一个数据集,其中基本上是与个人相关的时间片段,可以重叠(即一个片段可能开始得晚,但比前一个片段结束得早)。由于这个重叠问题,一旦它们按 start_date.
排序,我就很难获得序列中的最新 end_date
我一直在使用的代码在一定程度上有效,但我必须按照下面的代码所示重复。出于这个原因,我想我需要一些循环函数来完成一个过程,直到满足条件(end_date 晚于上一行的 end_date,或者 id 表示一个新的个体) .
library(dplyr)
## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
"A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
"2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
"2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
"2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
"2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
"2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
"2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
"2012-09-28","2013-10-11"))
df <- data.frame(id, start_date, end_date, target_date)
使用 让我很接近,但我认为它需要在某处添加延迟以复制 target_date...
df <- df %>%
arrange(id, start_date) %>%
group_by(id) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
cummax(as.numeric(end_date)))[-n()])) %>%
group_by(id, indx) %>%
mutate(latest_date = max(end_date)) %>%
ungroup()
我会给这个问题一个不同于使用 lag
的方法。问题是您的数据中存在可以具有多个级别的层次结构。
在下面的代码中,我尝试查找当前行所属的其他剧集(即完全位于另一个剧集中)。
然后我用 min(start_date)
和 max(end_date)
来定义最外面的情节。
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(id, start_date, end_date, target_date) %>%
mutate(episode = row_number())
df %>%
select(id, episode,start_date, end_date) %>%
inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
group_by(id,episode,start_date, end_date) %>%
nest() %>%
mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
end_date_outer >= ..3))) %>%
mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))
这导致
# A tibble: 19 x 8
id episode start_date end_date data match start_date_parent end_date_parent
<fct> <int> <date> <date> <list> <list> <date> <date>
1 A 1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23 2009-06-30
2 A 2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
3 A 3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
4 A 4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
5 A 5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
6 A 6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
7 A 7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
8 A 8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12 2010-11-05
9 A 9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08 2011-11-18
10 A 10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
11 A 11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
12 A 12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11 2012-09-11
13 A 13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03 2014-04-01
14 B 14 2003-12-01 2012-08-29 <tibble [6 x 3]> <tibble [1 x 3]> 2003-12-01 2012-08-29
15 B 15 2006-02-28 2006-02-28 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
16 B 16 2012-04-16 2012-04-16 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
17 B 17 2012-08-30 2012-09-28 <tibble [6 x 3]> <tibble [1 x 3]> 2012-08-30 2012-09-28
18 B 18 2012-09-19 2013-10-11 <tibble [6 x 3]> <tibble [1 x 3]> 2012-09-19 2013-10-11
19 B 19 2012-09-28 2013-07-19 <tibble [6 x 3]> <tibble [2 x 3]> 2012-09-19 2013-10-11
我们可以看到id A的前7集是第1集的一部分
其余的都是独立的。
另一种选择是使用 sqldf
,例如,如果数据集变大。
require(sqldf)
result <- sqldf("select
df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
from df AS df1
inner join df AS df2
on df1.id = df2.id
and df1.start_date >= df2.start_date
and df1.end_date <= df2.end_date
group by df1.id, df1.episode
")
result %>%
select(id, start_date, end_date) %>%
distinct()
结果:
id start_date end_date
1 A 2004-01-23 2009-06-30
2 A 2010-10-12 2010-11-05
3 A 2011-08-08 2011-11-18
4 A 2012-06-26 2012-06-26
5 A 2012-09-11 2012-09-11
6 A 2012-10-03 2014-04-01
7 B 2003-12-01 2012-08-29
8 B 2012-08-30 2012-09-28
9 B 2012-09-19 2013-10-11
如果我理解正确,OP 想要识别重叠的剧集,这些剧集被较长的剧集完全包含。此外,拥抱期的结束日期应出现在下一行(在 id
内)
这可以通过 :
的变体来实现
df %>%
arrange(id, start_date) %>% # df must be ordered appropriately
group_by(id) %>% # create new grouping variable
mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>%
group_by(id, grp) %>%
mutate(target_date_new = max(end_date)) %>%
group_by(id) %>% # re-group ...
mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6
# Groups: id [2]
id start_date end_date target_date grp target_date_new
<fct> <date> <date> <date> <int> <date>
1 A 2004-01-23 2009-06-30 NA 1 NA
2 A 2005-03-31 2005-09-17 2009-06-30 1 2009-06-30
3 A 2005-03-31 2005-09-19 2009-06-30 1 2009-06-30
4 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30
5 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30
6 A 2006-04-03 2006-06-19 2009-06-30 1 2009-06-30
7 A 2007-11-26 2009-06-30 2009-06-30 1 2009-06-30
8 A 2010-10-12 2010-11-05 2009-06-30 2 2009-06-30
9 A 2011-08-08 2011-11-18 2010-11-05 3 2010-11-05
10 A 2012-06-26 2012-06-26 2011-11-18 4 2011-11-18
11 A 2012-06-26 2012-06-26 2012-06-26 4 2012-06-26
12 A 2012-09-11 2012-09-11 2012-06-26 5 2012-06-26
13 A 2012-10-03 2014-04-01 2012-09-11 6 2012-09-11
14 B 2003-12-01 2012-08-29 NA 1 NA
15 B 2006-02-28 2006-02-28 2012-08-29 1 2012-08-29
16 B 2012-04-16 2012-04-16 2012-08-29 1 2012-08-29
17 B 2012-08-30 2012-09-28 2012-08-29 2 2012-08-29
18 B 2012-09-19 2013-10-11 2012-09-28 3 2012-09-28
19 B 2012-09-28 2013-07-19 2013-10-11 3 2013-10-11
此处比较了 end_date
s,因为 OP 想要检测完全拥抱的时期。因此,每当出现比之前任何 end_date
都大的 end_date
时,剧集计数器 grp
就会提前,因为当前剧集未完全包含在之前的期间中。
由于 cummax()
没有针对类型 Date
的对象的方法,日期被强制转换为整数值。
我有一个数据集,其中基本上是与个人相关的时间片段,可以重叠(即一个片段可能开始得晚,但比前一个片段结束得早)。由于这个重叠问题,一旦它们按 start_date.
排序,我就很难获得序列中的最新 end_date我一直在使用的代码在一定程度上有效,但我必须按照下面的代码所示重复。出于这个原因,我想我需要一些循环函数来完成一个过程,直到满足条件(end_date 晚于上一行的 end_date,或者 id 表示一个新的个体) .
library(dplyr)
## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
"A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
"2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
"2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
"2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
"2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
"2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
"2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
"2012-09-28","2013-10-11"))
df <- data.frame(id, start_date, end_date, target_date)
使用
df <- df %>%
arrange(id, start_date) %>%
group_by(id) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
cummax(as.numeric(end_date)))[-n()])) %>%
group_by(id, indx) %>%
mutate(latest_date = max(end_date)) %>%
ungroup()
我会给这个问题一个不同于使用 lag
的方法。问题是您的数据中存在可以具有多个级别的层次结构。
在下面的代码中,我尝试查找当前行所属的其他剧集(即完全位于另一个剧集中)。
然后我用 min(start_date)
和 max(end_date)
来定义最外面的情节。
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(id, start_date, end_date, target_date) %>%
mutate(episode = row_number())
df %>%
select(id, episode,start_date, end_date) %>%
inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
group_by(id,episode,start_date, end_date) %>%
nest() %>%
mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
end_date_outer >= ..3))) %>%
mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))
这导致
# A tibble: 19 x 8
id episode start_date end_date data match start_date_parent end_date_parent
<fct> <int> <date> <date> <list> <list> <date> <date>
1 A 1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23 2009-06-30
2 A 2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
3 A 3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
4 A 4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
5 A 5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
6 A 6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
7 A 7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
8 A 8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12 2010-11-05
9 A 9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08 2011-11-18
10 A 10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
11 A 11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
12 A 12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11 2012-09-11
13 A 13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03 2014-04-01
14 B 14 2003-12-01 2012-08-29 <tibble [6 x 3]> <tibble [1 x 3]> 2003-12-01 2012-08-29
15 B 15 2006-02-28 2006-02-28 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
16 B 16 2012-04-16 2012-04-16 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
17 B 17 2012-08-30 2012-09-28 <tibble [6 x 3]> <tibble [1 x 3]> 2012-08-30 2012-09-28
18 B 18 2012-09-19 2013-10-11 <tibble [6 x 3]> <tibble [1 x 3]> 2012-09-19 2013-10-11
19 B 19 2012-09-28 2013-07-19 <tibble [6 x 3]> <tibble [2 x 3]> 2012-09-19 2013-10-11
我们可以看到id A的前7集是第1集的一部分 其余的都是独立的。
另一种选择是使用 sqldf
,例如,如果数据集变大。
require(sqldf)
result <- sqldf("select
df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
from df AS df1
inner join df AS df2
on df1.id = df2.id
and df1.start_date >= df2.start_date
and df1.end_date <= df2.end_date
group by df1.id, df1.episode
")
result %>%
select(id, start_date, end_date) %>%
distinct()
结果:
id start_date end_date
1 A 2004-01-23 2009-06-30
2 A 2010-10-12 2010-11-05
3 A 2011-08-08 2011-11-18
4 A 2012-06-26 2012-06-26
5 A 2012-09-11 2012-09-11
6 A 2012-10-03 2014-04-01
7 B 2003-12-01 2012-08-29
8 B 2012-08-30 2012-09-28
9 B 2012-09-19 2013-10-11
如果我理解正确,OP 想要识别重叠的剧集,这些剧集被较长的剧集完全包含。此外,拥抱期的结束日期应出现在下一行(在 id
内)
这可以通过
df %>%
arrange(id, start_date) %>% # df must be ordered appropriately
group_by(id) %>% # create new grouping variable
mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>%
group_by(id, grp) %>%
mutate(target_date_new = max(end_date)) %>%
group_by(id) %>% # re-group ...
mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6 # Groups: id [2] id start_date end_date target_date grp target_date_new <fct> <date> <date> <date> <int> <date> 1 A 2004-01-23 2009-06-30 NA 1 NA 2 A 2005-03-31 2005-09-17 2009-06-30 1 2009-06-30 3 A 2005-03-31 2005-09-19 2009-06-30 1 2009-06-30 4 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 5 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 6 A 2006-04-03 2006-06-19 2009-06-30 1 2009-06-30 7 A 2007-11-26 2009-06-30 2009-06-30 1 2009-06-30 8 A 2010-10-12 2010-11-05 2009-06-30 2 2009-06-30 9 A 2011-08-08 2011-11-18 2010-11-05 3 2010-11-05 10 A 2012-06-26 2012-06-26 2011-11-18 4 2011-11-18 11 A 2012-06-26 2012-06-26 2012-06-26 4 2012-06-26 12 A 2012-09-11 2012-09-11 2012-06-26 5 2012-06-26 13 A 2012-10-03 2014-04-01 2012-09-11 6 2012-09-11 14 B 2003-12-01 2012-08-29 NA 1 NA 15 B 2006-02-28 2006-02-28 2012-08-29 1 2012-08-29 16 B 2012-04-16 2012-04-16 2012-08-29 1 2012-08-29 17 B 2012-08-30 2012-09-28 2012-08-29 2 2012-08-29 18 B 2012-09-19 2013-10-11 2012-09-28 3 2012-09-28 19 B 2012-09-28 2013-07-19 2013-10-11 3 2013-10-11
此处比较了 end_date
s,因为 OP 想要检测完全拥抱的时期。因此,每当出现比之前任何 end_date
都大的 end_date
时,剧集计数器 grp
就会提前,因为当前剧集未完全包含在之前的期间中。
由于 cummax()
没有针对类型 Date
的对象的方法,日期被强制转换为整数值。