在 R: 子集中,所以我只有事件发生前后 3 年的观察结果
In R: subset so that I only have the observations 3 years prior to and after an event
我找到了以下 link 我应该可以应用的答案,但似乎没有用:
以下是我的数据集的示例:
companyID year status
1 2000 1
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
2 2012 1
2 2013 0
2 2014 2
2 2015 2
2 2016 2
3 2008 1
3 2009 1
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
3 2017 2
我想获得以下观察结果,这样我现在只有事件发生前 3 年、事件发生年份(状态为 0)和事件发生后 3 年的观察结果:
companyID year status
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
如果我提供显示事件日期的变量会更容易吗?该变量将显示状态为 0 的同一观察(年份)中的日期。
提前感谢您的帮助!
用 dplyr
和 tidyr
试试这个:
library(dplyr)
library(tidyr)
df %>%
group_by(companyID, year) %>%
mutate(ref_yr = case_when(status == 0 ~ year,
TRUE ~ NA_integer_)) %>%
ungroup() %>%
group_by(companyID) %>%
fill(ref_yr, .direction = "downup") %>%
mutate(yr_diff = abs(ref_yr - year))%>%
filter(yr_diff <= 3) %>%
select(-c(ref_yr, yr_diff))
#> # A tibble: 19 x 3
#> # Groups: companyID [3]
#> companyID year status
#> <int> <int> <int>
#> 1 1 2001 1
#> 2 1 2002 1
#> 3 1 2003 1
#> 4 1 2004 0
#> 5 1 2005 2
#> 6 1 2006 2
#> 7 1 2007 2
#> 8 2 2012 1
#> 9 2 2013 0
#> 10 2 2014 2
#> 11 2 2015 2
#> 12 2 2016 2
#> 13 3 2010 1
#> 14 3 2011 1
#> 15 3 2012 1
#> 16 3 2013 0
#> 17 3 2014 2
#> 18 3 2015 2
#> 19 3 2016 2
数据
df <-结构(列表(公司ID = c(1L,1L,1L,1L,1L,1L,1L,1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
年份 = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2012L, 2013L, 2014L, 2015L, 2016L, 2008L, 2009L, 2010L,
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), 状态 = c(1L,
1L, 1L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 0L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-23L))
由 reprex package (v2.0.0)
于 2021-04-25 创建
如果你的数据框是df
:
zeros <- which(df$status == 0)
calcrows <- sapply(zeros, function(x) (x-3):(x+3))
df2 <- df[calcrows, ]
这个有用吗:
library(dplyr)
df %>% group_by(companyID) %>%
mutate(flag1 = year[status == 0] - min(year), flag2 = max(year) - year[status == 0]) %>%
filter(flag1 > 2 & flag2 > 2 & between(year,year[status == 0] - 3, year[status == 0] + 3)) %>% select(-flag1, -flag2)
# A tibble: 14 x 3
# Groups: companyID [2]
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
这可以通过 group_by
arrange
和 filter
来实现
library(dplyr)
df %>% group_by(companyID) %>%
arrange(status, year, .by_group = TRUE) %>%
filter(year >= first(year)- 3 & year <= first(year)+ 3) %>%
filter(n() >=7) %>%
arrange(year)
输出:
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
我找到了以下 link 我应该可以应用的答案,但似乎没有用:
以下是我的数据集的示例:
companyID year status
1 2000 1
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
2 2012 1
2 2013 0
2 2014 2
2 2015 2
2 2016 2
3 2008 1
3 2009 1
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
3 2017 2
我想获得以下观察结果,这样我现在只有事件发生前 3 年、事件发生年份(状态为 0)和事件发生后 3 年的观察结果:
companyID year status
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
如果我提供显示事件日期的变量会更容易吗?该变量将显示状态为 0 的同一观察(年份)中的日期。
提前感谢您的帮助!
用 dplyr
和 tidyr
试试这个:
library(dplyr)
library(tidyr)
df %>%
group_by(companyID, year) %>%
mutate(ref_yr = case_when(status == 0 ~ year,
TRUE ~ NA_integer_)) %>%
ungroup() %>%
group_by(companyID) %>%
fill(ref_yr, .direction = "downup") %>%
mutate(yr_diff = abs(ref_yr - year))%>%
filter(yr_diff <= 3) %>%
select(-c(ref_yr, yr_diff))
#> # A tibble: 19 x 3
#> # Groups: companyID [3]
#> companyID year status
#> <int> <int> <int>
#> 1 1 2001 1
#> 2 1 2002 1
#> 3 1 2003 1
#> 4 1 2004 0
#> 5 1 2005 2
#> 6 1 2006 2
#> 7 1 2007 2
#> 8 2 2012 1
#> 9 2 2013 0
#> 10 2 2014 2
#> 11 2 2015 2
#> 12 2 2016 2
#> 13 3 2010 1
#> 14 3 2011 1
#> 15 3 2012 1
#> 16 3 2013 0
#> 17 3 2014 2
#> 18 3 2015 2
#> 19 3 2016 2
数据
df <-结构(列表(公司ID = c(1L,1L,1L,1L,1L,1L,1L,1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), 年份 = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2012L, 2013L, 2014L, 2015L, 2016L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), 状态 = c(1L, 1L, 1L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 0L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -23L)) 由 reprex package (v2.0.0)
于 2021-04-25 创建如果你的数据框是df
:
zeros <- which(df$status == 0)
calcrows <- sapply(zeros, function(x) (x-3):(x+3))
df2 <- df[calcrows, ]
这个有用吗:
library(dplyr)
df %>% group_by(companyID) %>%
mutate(flag1 = year[status == 0] - min(year), flag2 = max(year) - year[status == 0]) %>%
filter(flag1 > 2 & flag2 > 2 & between(year,year[status == 0] - 3, year[status == 0] + 3)) %>% select(-flag1, -flag2)
# A tibble: 14 x 3
# Groups: companyID [2]
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
这可以通过 group_by
arrange
和 filter
library(dplyr)
df %>% group_by(companyID) %>%
arrange(status, year, .by_group = TRUE) %>%
filter(year >= first(year)- 3 & year <= first(year)+ 3) %>%
filter(n() >=7) %>%
arrange(year)
输出:
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2