R 中的滞后函数。计算数据框中事件的先前出现次数
Lag function in R. Count previous ocurrences of an event in a data frame
我是 R 的新手。
我有一个这样的数据框:
p_id start_date ch end_date
5713729 01/10/2014 1 20/03/2015
5713729 01/04/2016 0 NA
5713731 01/12/2010 1 03/02/2012
5713731 01/04/2013 1 30/10/2014
5713731 01/01/2015 0 NA
5713735 01/07/2012 0 NA
5713736 01/07/2007 1 30/06/2012
5713736 01/04/2016 0 NA
5713737 01/06/2016 0 NA
我需要为每个 p_id 计算每一行中事件 "ch" 之前发生的次数。
因此数据框必须按 p_id 和日期 (asc) 排序。
首先我尝试了 ifelse 函数:
#sort
library(dplyr)
data <- data %>% arrange(p_id,start_date,end_date)
#initialize count:
data$count_ch_prev <- 0
#count (not good...)
data$count_ch_prev <- ifelse(data$p_id ==
lag(data$p_id,1),lag(data$count_ch_prev,1) +
lag(data$ch,1),data$count_ch_prev)
结果是:
p_id start_date ch end_date count_ch_prev
5713729 01/10/2014 1 20/03/2015 NA
5713729 01/04/2016 0 NA 1
5713731 01/12/2010 1 03/02/2012 0
5713731 01/04/2013 1 30/10/2014 1
5713731 01/01/2015 0 NA 1
5713735 01/07/2012 0 NA 0
5713736 01/07/2007 1 30/06/2012 0
5713736 01/04/2016 0 NA 1
5713737 01/06/2016 0 NA 0
寻找类似的问题 (),我意识到这个函数是矢量化的,所以它不会逐行计算。相反,它同时计算所有行。
我的预期结果是这样的:
p_id start_date ch end_date count_ch_prev
5713729 01/10/2014 1 20/03/2015 0
5713729 01/04/2016 0 NA 1
5713731 01/12/2010 1 03/02/2012 0
5713731 01/04/2013 1 30/10/2014 1
5713731 01/01/2015 0 NA 2
5713735 01/07/2012 0 NA 0
5713736 01/07/2007 1 30/06/2012 0
5713736 01/04/2016 0 NA 1
5713737 01/06/2016 0 NA 0
我也试过 while 循环:
data$count_ch_prev <- 0
while (data$p_id == lag(data$p_id,1)) {
data$count_ch_prev <- lag(data$count_ch_prev) + lag(data$ch)
}
但我得到了相同的 "as a whole" 结果。我必须使用哪个功能?
要复制的代码:
p_id <-
c(5713729,5713729,5713731,5713731,5713731,5713735,5713736,5713736,5713737)
start_date <- as.Date(c('2014-10-01','2016-04-01','2010-12-01','2013-04-
01','2015-01-01','2012-07-01','2007-07-01','2016-04-01','2016-06-01'))
end_date <- as.Date(c('2015-03-20',NA,'2012-02-03','2014-10-30',NA,NA,'2012-
06-30',NA,NA))
ch <- c(1,0,1,1,0,0,1,0,0)
data <- data.frame(p_id,start_date,ch,end_date)
我想你可以使用 dplyr
按 p_id
分组,然后使用 lag
和 cumsum
:
library(dplyr)
data %>%
group_by(p_id) %>%
mutate(count_ch_prev = lag(cumsum(ch), default = 0))
输出:
# A tibble: 9 x 5
# Groups: p_id [5]
p_id start_date ch end_date count_ch_prev
<dbl> <date> <dbl> <date> <dbl>
1 5713729 2014-10-01 1 2015-03-20 0
2 5713729 2016-04-01 0 NA 1
3 5713731 2010-12-01 1 2012-02-03 0
4 5713731 NA 1 2014-10-30 1
5 5713731 2015-01-01 0 NA 2
6 5713735 2012-07-01 0 NA 0
7 5713736 2007-07-01 1 NA 0
8 5713736 2016-04-01 0 NA 1
9 5713737 2016-06-01 0 NA 0
数据table替代:
library(data.table)
dt <- data.table(data)
dt[, count_ch_prev := shift(cumsum(ch), fill = 0), by = p_id]
输出:
> dt
p_id start_date ch end_date count_ch_prev
1: 5713729 2014-10-01 1 2015-03-20 0
2: 5713729 2016-04-01 0 <NA> 1
3: 5713731 2010-12-01 1 2012-02-03 0
4: 5713731 <NA> 1 2014-10-30 1
5: 5713731 2015-01-01 0 <NA> 2
6: 5713735 2012-07-01 0 <NA> 0
7: 5713736 2007-07-01 1 <NA> 0
8: 5713736 2016-04-01 0 <NA> 1
9: 5713737 2016-06-01 0 <NA> 0
我是 R 的新手。 我有一个这样的数据框:
p_id start_date ch end_date
5713729 01/10/2014 1 20/03/2015
5713729 01/04/2016 0 NA
5713731 01/12/2010 1 03/02/2012
5713731 01/04/2013 1 30/10/2014
5713731 01/01/2015 0 NA
5713735 01/07/2012 0 NA
5713736 01/07/2007 1 30/06/2012
5713736 01/04/2016 0 NA
5713737 01/06/2016 0 NA
我需要为每个 p_id 计算每一行中事件 "ch" 之前发生的次数。 因此数据框必须按 p_id 和日期 (asc) 排序。 首先我尝试了 ifelse 函数:
#sort
library(dplyr)
data <- data %>% arrange(p_id,start_date,end_date)
#initialize count:
data$count_ch_prev <- 0
#count (not good...)
data$count_ch_prev <- ifelse(data$p_id ==
lag(data$p_id,1),lag(data$count_ch_prev,1) +
lag(data$ch,1),data$count_ch_prev)
结果是:
p_id start_date ch end_date count_ch_prev
5713729 01/10/2014 1 20/03/2015 NA
5713729 01/04/2016 0 NA 1
5713731 01/12/2010 1 03/02/2012 0
5713731 01/04/2013 1 30/10/2014 1
5713731 01/01/2015 0 NA 1
5713735 01/07/2012 0 NA 0
5713736 01/07/2007 1 30/06/2012 0
5713736 01/04/2016 0 NA 1
5713737 01/06/2016 0 NA 0
寻找类似的问题 (
我的预期结果是这样的:
p_id start_date ch end_date count_ch_prev
5713729 01/10/2014 1 20/03/2015 0
5713729 01/04/2016 0 NA 1
5713731 01/12/2010 1 03/02/2012 0
5713731 01/04/2013 1 30/10/2014 1
5713731 01/01/2015 0 NA 2
5713735 01/07/2012 0 NA 0
5713736 01/07/2007 1 30/06/2012 0
5713736 01/04/2016 0 NA 1
5713737 01/06/2016 0 NA 0
我也试过 while 循环:
data$count_ch_prev <- 0
while (data$p_id == lag(data$p_id,1)) {
data$count_ch_prev <- lag(data$count_ch_prev) + lag(data$ch)
}
但我得到了相同的 "as a whole" 结果。我必须使用哪个功能?
要复制的代码:
p_id <-
c(5713729,5713729,5713731,5713731,5713731,5713735,5713736,5713736,5713737)
start_date <- as.Date(c('2014-10-01','2016-04-01','2010-12-01','2013-04-
01','2015-01-01','2012-07-01','2007-07-01','2016-04-01','2016-06-01'))
end_date <- as.Date(c('2015-03-20',NA,'2012-02-03','2014-10-30',NA,NA,'2012-
06-30',NA,NA))
ch <- c(1,0,1,1,0,0,1,0,0)
data <- data.frame(p_id,start_date,ch,end_date)
我想你可以使用 dplyr
按 p_id
分组,然后使用 lag
和 cumsum
:
library(dplyr)
data %>%
group_by(p_id) %>%
mutate(count_ch_prev = lag(cumsum(ch), default = 0))
输出:
# A tibble: 9 x 5
# Groups: p_id [5]
p_id start_date ch end_date count_ch_prev
<dbl> <date> <dbl> <date> <dbl>
1 5713729 2014-10-01 1 2015-03-20 0
2 5713729 2016-04-01 0 NA 1
3 5713731 2010-12-01 1 2012-02-03 0
4 5713731 NA 1 2014-10-30 1
5 5713731 2015-01-01 0 NA 2
6 5713735 2012-07-01 0 NA 0
7 5713736 2007-07-01 1 NA 0
8 5713736 2016-04-01 0 NA 1
9 5713737 2016-06-01 0 NA 0
数据table替代:
library(data.table)
dt <- data.table(data)
dt[, count_ch_prev := shift(cumsum(ch), fill = 0), by = p_id]
输出:
> dt
p_id start_date ch end_date count_ch_prev
1: 5713729 2014-10-01 1 2015-03-20 0
2: 5713729 2016-04-01 0 <NA> 1
3: 5713731 2010-12-01 1 2012-02-03 0
4: 5713731 <NA> 1 2014-10-30 1
5: 5713731 2015-01-01 0 <NA> 2
6: 5713735 2012-07-01 0 <NA> 0
7: 5713736 2007-07-01 1 <NA> 0
8: 5713736 2016-04-01 0 <NA> 1
9: 5713737 2016-06-01 0 <NA> 0