当条件匹配时在 R 中滞后
Lagging in R when a condition is matched
我有一个数据框,其中仅包含体检日期和感染情况 (yes/no),我想添加第三列来表示上次感染的日期。如果患者之前没有感染,新的 last_infection
列应该有 NA
。如果他们以前感染过,它应该显示最近 最近 访问他们测试 "yes" 感染的日期。
我希望输出如下所示:
date infection last_infection
01-01-18 no NA
06-01-18 no NA
07-01-18 yes NA
09-01-18 no 07-01-18
01-01-19 no 07-01-18
02-01-19 yes 07-01-18
03-01-19 yes 02-01-19
04-01-19 no 03-01-19
05-01-19 no 03-01-19
我如何在 R 中执行此操作? lag()
之类的函数可以检查条件,还是我应该完全做其他事情?
我们可以根据使用 'infection' 创建的逻辑向量创建分组变量,并将其用于 lag
列。在这里,我们只加载 dplyr
而不是任何其他包
library(dplyr)
df1 %>%
group_by(grp = cumsum(infection == "yes")) %>%
mutate(new = first(date)) %>%
ungroup %>%
mutate(new = replace(lag(new), seq_len(match(1, grp)), NA)) %>%
select(-grp)
# A tibble: 9 x 4
# date infection last_infection new
# <chr> <chr> <chr> <chr>
#1 01-01-18 no <NA> <NA>
#2 06-01-18 no <NA> <NA>
#3 07-01-18 yes <NA> <NA>
#4 09-01-18 no 07-01-18 07-01-18
#5 01-01-19 no 07-01-18 07-01-18
#6 02-01-19 yes 07-01-18 07-01-18
#7 03-01-19 yes 02-01-19 02-01-19
#8 04-01-19 no 03-01-19 03-01-19
#9 05-01-19 no 03-01-19 03-01-19
数据
df1 <- structure(list(date = c("01-01-18", "06-01-18", "07-01-18", "09-01-18",
"01-01-19", "02-01-19", "03-01-19", "04-01-19", "05-01-19"),
infection = c("no", "no", "yes", "no", "no", "yes", "yes",
"no", "no"), last_infection = c(NA, NA, NA, "07-01-18", "07-01-18",
"07-01-18", "02-01-19", "03-01-19", "03-01-19")),
class = "data.frame", row.names = c(NA,
-9L))
我建议改为这样。如果您使用 tidyr 包中的 fill
,则没有理由使用 cumsum 或分组。
library(tidyverse)
df %>%
mutate(
last_infection = if_else(lag(infection) == "yes", lag(date), NA_character_)
) %>%
fill(last_infection)
#> # A tibble: 9 x 3
#> date infection last_infection
#> <chr> <chr> <chr>
#> 1 01-01-18 no <NA>
#> 2 06-01-18 no <NA>
#> 3 07-01-18 yes <NA>
#> 4 09-01-18 no 07-01-18
#> 5 01-01-19 no 07-01-18
#> 6 02-01-19 yes 07-01-18
#> 7 03-01-19 yes 02-01-19
#> 8 04-01-19 no 03-01-19
#> 9 05-01-19 no 03-01-19
由 reprex package (v0.3.0)
于 2020 年 1 月 25 日创建
我有一个数据框,其中仅包含体检日期和感染情况 (yes/no),我想添加第三列来表示上次感染的日期。如果患者之前没有感染,新的 last_infection
列应该有 NA
。如果他们以前感染过,它应该显示最近 最近 访问他们测试 "yes" 感染的日期。
我希望输出如下所示:
date infection last_infection
01-01-18 no NA
06-01-18 no NA
07-01-18 yes NA
09-01-18 no 07-01-18
01-01-19 no 07-01-18
02-01-19 yes 07-01-18
03-01-19 yes 02-01-19
04-01-19 no 03-01-19
05-01-19 no 03-01-19
我如何在 R 中执行此操作? lag()
之类的函数可以检查条件,还是我应该完全做其他事情?
我们可以根据使用 'infection' 创建的逻辑向量创建分组变量,并将其用于 lag
列。在这里,我们只加载 dplyr
而不是任何其他包
library(dplyr)
df1 %>%
group_by(grp = cumsum(infection == "yes")) %>%
mutate(new = first(date)) %>%
ungroup %>%
mutate(new = replace(lag(new), seq_len(match(1, grp)), NA)) %>%
select(-grp)
# A tibble: 9 x 4
# date infection last_infection new
# <chr> <chr> <chr> <chr>
#1 01-01-18 no <NA> <NA>
#2 06-01-18 no <NA> <NA>
#3 07-01-18 yes <NA> <NA>
#4 09-01-18 no 07-01-18 07-01-18
#5 01-01-19 no 07-01-18 07-01-18
#6 02-01-19 yes 07-01-18 07-01-18
#7 03-01-19 yes 02-01-19 02-01-19
#8 04-01-19 no 03-01-19 03-01-19
#9 05-01-19 no 03-01-19 03-01-19
数据
df1 <- structure(list(date = c("01-01-18", "06-01-18", "07-01-18", "09-01-18",
"01-01-19", "02-01-19", "03-01-19", "04-01-19", "05-01-19"),
infection = c("no", "no", "yes", "no", "no", "yes", "yes",
"no", "no"), last_infection = c(NA, NA, NA, "07-01-18", "07-01-18",
"07-01-18", "02-01-19", "03-01-19", "03-01-19")),
class = "data.frame", row.names = c(NA,
-9L))
我建议改为这样。如果您使用 tidyr 包中的 fill
,则没有理由使用 cumsum 或分组。
library(tidyverse)
df %>%
mutate(
last_infection = if_else(lag(infection) == "yes", lag(date), NA_character_)
) %>%
fill(last_infection)
#> # A tibble: 9 x 3
#> date infection last_infection
#> <chr> <chr> <chr>
#> 1 01-01-18 no <NA>
#> 2 06-01-18 no <NA>
#> 3 07-01-18 yes <NA>
#> 4 09-01-18 no 07-01-18
#> 5 01-01-19 no 07-01-18
#> 6 02-01-19 yes 07-01-18
#> 7 03-01-19 yes 02-01-19
#> 8 04-01-19 no 03-01-19
#> 9 05-01-19 no 03-01-19
由 reprex package (v0.3.0)
于 2020 年 1 月 25 日创建