根据组和年份创建 pre 和 post 标志

Create pre and post flags based on group and year

我正在尝试标记观察前和 post- 我们为每个公司记录的标记。

下面是虚拟数据,我很难弄明白,但我想这里有一个优雅的解决方案,可以将 employee 分组并确定旗帜的年份。

employee <- c('a','a','a','a','b','b','b','b','b','c', 'c', 'c', 'c')
year <- c('2001','2002','2003','2004','2001','2002','2003','2004','2005','2001','2002','2003','2004')
flag <- c('NA','NA','1','NA','NA','1','NA','NA','NA','1','NA','NA','NA')


start <- data.frame(employee, year, flag)

   employee year flag
1         a 2001   NA
2         a 2002   NA
3         a 2003    1
4         a 2004   NA
5         b 2001   NA
6         b 2002    1
7         b 2003   NA
8         b 2004   NA
9         b 2005   NA
10        c 2001    1
11        c 2002   NA
12        c 2003   NA
13        c 2004   NA
prepo <- c('pre','pre','po','po','pre','po','po','po','po','po','po','po','po')

end <- data.frame(employee, year, flag, prepo)

   employee year flag prepo
1         a 2001   NA   pre
2         a 2002   NA   pre
3         a 2003    1    po
4         a 2004   NA    po
5         b 2001   NA   pre
6         b 2002    1    po
7         b 2003   NA    po
8         b 2004   NA    po
9         b 2005   NA    po
10        c 2001    1    po
11        c 2002   NA    po
12        c 2003   NA    po
13        c 2004   NA    po

我们将字符串 "NA" 转换为 NA (na_if),按 'employee' 分组,根据第一个 NA 的出现创建条件 case_when在 'flag' 中将值更改为 'pre' 和 'po'

library(dplyr)
start %>% 
    mutate(flag = na_if(flag, 'NA')) %>%
    group_by(employee) %>%
    mutate(prepo =  case_when(row_number() < which(!is.na(flag))[1] 
          ~ 'pre', TRUE ~ 'po')) %>%
    ungroup

-输出

# A tibble: 13 x 4
#   employee year  flag  prepo
#   <chr>    <chr> <chr> <chr>
# 1 a        2001  <NA>  pre  
# 2 a        2002  <NA>  pre  
# 3 a        2003  1     po   
# 4 a        2004  <NA>  po   
# 5 b        2001  <NA>  pre  
# 6 b        2002  1     po   
# 7 b        2003  <NA>  po   
# 8 b        2004  <NA>  po   
# 9 b        2005  <NA>  po   
#10 c        2001  1     po   
#11 c        2002  <NA>  po   
#12 c        2003  <NA>  po   
#13 c        2004  <NA>  po   

或者另一种选择是使用 cumsum 创建索引并根据索引

替换值
start %>% 
  arrange(employee, year) %>%
  group_by(employee) %>% 
  mutate(prepo = c('pre', 'po')[cumsum(replace(flag, 
          flag == "NA", 0))+1]) %>% 
  ungroup

我不确定这是否适用于您的一般情况

> setDT(start)[, prepo := c("pre", "po")[cumsum(flag == "1") + 1], employee][]
    employee year flag prepo
 1:        a 2001   NA   pre
 2:        a 2002   NA   pre
 3:        a 2003    1    po
 4:        a 2004   NA    po
 5:        b 2001   NA   pre
 6:        b 2002    1    po
 7:        b 2003   NA    po
 8:        b 2004   NA    po
 9:        b 2005   NA    po
10:        c 2001    1    po
11:        c 2002   NA    po
12:        c 2003   NA    po
13:        c 2004   NA    po