在 R: 子集中，所以我只有事件发生前后 3 年的观察结果

Question

我找到了以下 link 我应该可以应用的答案，但似乎没有用：

以下是我的数据集的示例：

companyID   year   status
    1       2000     1
    1       2001     1
    1       2002     1
    1       2003     1
    1       2004     0
    1       2005     2
    1       2006     2
    1       2007     2
    2       2012     1
    2       2013     0
    2       2014     2
    2       2015     2
    2       2016     2
    3       2008     1
    3       2009     1
    3       2010     1
    3       2011     1
    3       2012     1
    3       2013     0
    3       2014     2
    3       2015     2
    3       2016     2
    3       2017     2

我想获得以下观察结果，这样我现在只有事件发生前 3 年、事件发生年份（状态为 0）和事件发生后 3 年的观察结果：

companyID   year   status
    1       2001     1
    1       2002     1
    1       2003     1
    1       2004     0
    1       2005     2
    1       2006     2
    1       2007     2
    3       2010     1
    3       2011     1
    3       2012     1
    3       2013     0
    3       2014     2
    3       2015     2
    3       2016     2

如果我提供显示事件日期的变量会更容易吗？该变量将显示状态为 0 的同一观察（年份）中的日期。

提前感谢您的帮助！

Answer 1

用 dplyr 和 tidyr 试试这个：

library(dplyr)
library(tidyr)

df %>% 
  group_by(companyID, year) %>% 
  mutate(ref_yr = case_when(status == 0 ~ year,
                            TRUE ~ NA_integer_)) %>%
  ungroup() %>% 
  group_by(companyID) %>% 
  fill(ref_yr, .direction = "downup") %>% 
  mutate(yr_diff = abs(ref_yr - year))%>% 
  filter(yr_diff <= 3) %>% 
  select(-c(ref_yr, yr_diff))
#> # A tibble: 19 x 3
#> # Groups:   companyID [3]
#>    companyID  year status
#>        <int> <int>  <int>
#>  1         1  2001      1
#>  2         1  2002      1
#>  3         1  2003      1
#>  4         1  2004      0
#>  5         1  2005      2
#>  6         1  2006      2
#>  7         1  2007      2
#>  8         2  2012      1
#>  9         2  2013      0
#> 10         2  2014      2
#> 11         2  2015      2
#> 12         2  2016      2
#> 13         3  2010      1
#> 14         3  2011      1
#> 15         3  2012      1
#> 16         3  2013      0
#> 17         3  2014      2
#> 18         3  2015      2
#> 19         3  2016      2

数据

df <-结构（列表（公司ID = c（1L，1L，1L，1L，1L，1L，1L，1L， 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), 年份 = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2012L, 2013L, 2014L, 2015L, 2016L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), 状态 = c(1L, 1L, 1L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 0L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -23L)) ^{由 reprex package (v2.0.0)}

于 2021-04-25 创建

Answer 2

如果你的数据框是df:

zeros <- which(df$status == 0)
calcrows <- sapply(zeros, function(x) (x-3):(x+3))
df2 <- df[calcrows, ]

Answer 3

这个有用吗：

library(dplyr)

df %>% group_by(companyID) %>% 
   mutate(flag1 = year[status == 0] - min(year), flag2 = max(year) - year[status == 0]) %>% 
     filter(flag1 > 2 & flag2 > 2 & between(year,year[status == 0] - 3, year[status == 0] + 3)) %>% select(-flag1, -flag2)
# A tibble: 14 x 3
# Groups:   companyID [2]
   companyID  year status
       <int> <int>  <int>
 1         1  2001      1
 2         1  2002      1
 3         1  2003      1
 4         1  2004      0
 5         1  2005      2
 6         1  2006      2
 7         1  2007      2
 8         3  2010      1
 9         3  2011      1
10         3  2012      1
11         3  2013      0
12         3  2014      2
13         3  2015      2
14         3  2016      2

Answer 4

这可以通过 group_by arrange 和 filter

来实现

library(dplyr)
df %>% group_by(companyID) %>% 
  arrange(status, year, .by_group = TRUE) %>% 
  filter(year >= first(year)- 3 & year <= first(year)+ 3) %>% 
  filter(n() >=7) %>% 
  arrange(year)

输出：

   companyID  year status
       <int> <int>  <int>
 1         1  2001      1
 2         1  2002      1
 3         1  2003      1
 4         1  2004      0
 5         1  2005      2
 6         1  2006      2
 7         1  2007      2
 8         3  2010      1
 9         3  2011      1
10         3  2012      1
11         3  2013      0
12         3  2014      2
13         3  2015      2
14         3  2016      2

在 R: 子集中，所以我只有事件发生前后 3 年的观察结果

In R: subset so that I only have the observations 3 years prior to and after an event

r

subset

dplyr