`purrr` 替代基于复杂规则集确定事件日期的行式函数
`purrr` alternative to row-wise function that determines event date based on complex rule set
我正在与一位客户合作,该客户希望提供一个输入电子表格,其中包含特定事件在给定年份何时发生的文本描述。每个事件(至少有 200 个)是单独的一行,包含关于何时发生的复杂规则集,例如,“10 月 1 日之前的第一个星期六” 或 “最接近 12 月 1 日的星期五”。也有几次事件只是在特定日期发生,但这种情况很少见。但是,实际的电子表格有大约 15 列控制每个事件的开始日期,因此我需要用来计算开始日期的逻辑非常深入。
我想出了一种方法来计算开始日期,使用一个函数和一个循环遍历我的 data.frame
的每一行,但我想知道是否有更有效的方法 tidyverse
或purrr
解决这个问题。是否可以(或建议)对这个问题的解决方案进行矢量化?
这是我目前(可行的)解决方案,适用于我能想到的最小、最紧凑的示例。对于更复杂的现实世界输入,我能否使它更高效、更易读?
library(tidyverse)
library(lubridate)
# Bring in demo data that describes 3 events, and when they should each start.
demo <- structure(list(Event = c("Gala", "Celebration", "Wrap-up"), date_start
= structure(c(18871, NA, NA), class = "Date"), weekday_near = c(NA,
"Saturday", "Friday" ), near_description = c(NA, "before", "closest to"),
near_date = structure(c(NA, 18901, 18962), class = "Date")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
这是 demo
数据的样子:
Event date_start weekday_near near_description near_date
Gala 2021-09-01 NA NA NA
Celebration NA Saturday before 2021-10-01
Wrap-up NA Friday closest to 2021-12-01
现在,确定每个活动的开始日期 - 晚会、庆典和总结。
# Create a tibble that contains all possible dates for these events this year.
datedb <- tibble(date = seq(make_date(2021, 9, 1), make_date(2021, 12, 31), by = 1),
wday = wday(date, label = TRUE, abbr = FALSE))
# Write function meant to determine event date for each row of the dataframe.
determine_date <- function(df){
# define variables that are easier to read
# this part makes me squeamish -
# there's gotta be a better way to do this with the tidyverse
event_date_exact <- df[["date_start"]]
event_near_wday <- df[["weekday_near"]]
event_near_desc <- df[["near_description"]]
event_near_date <- df[["near_date"]]
# Event date - if there is an exact date for the event, choose it as the event date.
if (!is.na(event_date_exact)) {
event_date <- event_date_exact
# Otherwise, if the date is dependent on another date, figure out when it should be:
} else {
event_date_vec <- datedb %>% filter(wday == event_near_wday) %>% pull(date)
event_date <-
case_when(
# If you're looking for the closest weekday to a particular date:
event_near_desc == "closest to" ~ event_date_vec[which(abs(event_date_vec - event_near_date) ==
min(abs(event_date_vec - event_near_date), na.rm = TRUE))],
# If you're looking for the first weekday before that weekday
event_near_desc == "before" ~ rev(event_date_vec[which(event_date_vec - event_near_date < 0)])[1],
# If neither of these worked, output NA to check why
TRUE ~ NA_Date_
)
}
}
# create empty vector to store results
start_dates <- lubridate::ymd()
for (i in 1:nrow(demo)) {
start_dates[i] <- determine_date(demo[i,])
}
# add start dates back to original demo dataframe
demo$start_date <- start_dates
期望的输出:
注意新的start_date
列
demo
Event date_start weekday_near near_description near_date start_date
Gala 2021-09-01 NA NA NA 2021-09-01
Celebration NA Saturday before 2021-10-01 2021-09-25
Wrap-up NA Friday closest to 2021-12-01 2021-12-03
如果您想向量化一个函数,实际上它只是调用 mapply
。所以,如果你想使用 purrr
风格编码,你可能只想修改你的函数参数如下:
设置:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
# Bring in demo data that describes 3 events, and when they should each start.
demo <- structure(list(Event = c("Gala", "Celebration", "Wrap-up"), date_start
= structure(c(18871, NA, NA), class = "Date"), weekday_near = c(NA,
"Saturday", "Friday" ), near_description = c(NA, "before", "closest to"),
near_date = structure(c(NA, 18901, 18962), class = "Date")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
datedb <- tibble(date = seq(make_date(2021, 9, 1), make_date(2021, 12, 31), by = 1),
wday = wday(date, label = TRUE, abbr = FALSE))
这是您函数的重构版本。
使用 case_when
而不是 switch
语句完全取决于您。我选择使用 switch
因为此函数旨在在 pmap 调用中调用,即我们希望它仅检查单个值。
#write a function that expects 4 input values
#vectorize/pmap over each.
determine_date2 <- function(date_start, weekday_near, near_desc, near_date){
event_vec <- datedb %>% filter(wday == weekday_near) %>% pull(date)
event_date <-
if(!is.na(date_start)){
date_start
} else if(!is.na(near_desc)){
switch(
near_desc,
`closest to` = event_vec[which(abs(event_vec - near_date) == min(abs(event_vec - near_date), na.rm = TRUE))],
before = rev(event_vec[which(event_vec - near_date < 0)])[1],
NA_Date_
)
} else {
NA_Date_
}
event_date
}
实际上我刚刚发现并没有真正的 pmap_date
变体,但我在下面制作的应该足以作为替代品。
pmap_date <- function(.l, .f, ...){
res <- pmap(.l, .f, ...)
check_res <- map_lgl(res, ~is.Date(.x)&&is_scalar_vector(.x))
if(!all(check_res)){
rlang::abort(glue::glue("all results must return a scalar date. offending entries: ",glue::glue_collapse("{!which(check_res)}", sep = ", ")))
}
return(reduce(res, c))
}
现在我们应该可以在 mutate 函数中使用 pmap_date
demo %>%
mutate(
start_dates = pmap_date(list(date_start, weekday_near, near_description, near_date), determine_date2)
)
#> # A tibble: 3 x 6
#> Event date_start weekday_near near_description near_date start_dates
#> <chr> <date> <chr> <chr> <date> <date>
#> 1 Gala 2021-09-01 <NA> <NA> NA 2021-09-01
#> 2 Celebration NA Saturday before 2021-10-01 2021-09-25
#> 3 Wrap-up NA Friday closest to 2021-12-01 2021-12-03
如果您愿意,可以使“矢量化”包装函数就像您自己调用 Vectorize
函数一样:
v_determine_date2 <- function(date_start, weekday_near, near_desc, near_date) pmap_date(list(date_start, weekday_near, near_desc, near_date), determine_date2)
demo %>%
mutate(
start_dates = v_determine_date2(date_start, weekday_near, near_description, near_date)
)
#> # A tibble: 3 x 6
#> Event date_start weekday_near near_description near_date start_dates
#> <chr> <date> <chr> <chr> <date> <date>
#> 1 Gala 2021-09-01 <NA> <NA> NA 2021-09-01
#> 2 Celebration NA Saturday before 2021-10-01 2021-09-25
#> 3 Wrap-up NA Friday closest to 2021-12-01 2021-12-03
由 reprex package (v1.0.0)
于 2021-05-11 创建
我正在与一位客户合作,该客户希望提供一个输入电子表格,其中包含特定事件在给定年份何时发生的文本描述。每个事件(至少有 200 个)是单独的一行,包含关于何时发生的复杂规则集,例如,“10 月 1 日之前的第一个星期六” 或 “最接近 12 月 1 日的星期五”。也有几次事件只是在特定日期发生,但这种情况很少见。但是,实际的电子表格有大约 15 列控制每个事件的开始日期,因此我需要用来计算开始日期的逻辑非常深入。
我想出了一种方法来计算开始日期,使用一个函数和一个循环遍历我的 data.frame
的每一行,但我想知道是否有更有效的方法 tidyverse
或purrr
解决这个问题。是否可以(或建议)对这个问题的解决方案进行矢量化?
这是我目前(可行的)解决方案,适用于我能想到的最小、最紧凑的示例。对于更复杂的现实世界输入,我能否使它更高效、更易读?
library(tidyverse)
library(lubridate)
# Bring in demo data that describes 3 events, and when they should each start.
demo <- structure(list(Event = c("Gala", "Celebration", "Wrap-up"), date_start
= structure(c(18871, NA, NA), class = "Date"), weekday_near = c(NA,
"Saturday", "Friday" ), near_description = c(NA, "before", "closest to"),
near_date = structure(c(NA, 18901, 18962), class = "Date")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
这是 demo
数据的样子:
Event date_start weekday_near near_description near_date
Gala 2021-09-01 NA NA NA
Celebration NA Saturday before 2021-10-01
Wrap-up NA Friday closest to 2021-12-01
现在,确定每个活动的开始日期 - 晚会、庆典和总结。
# Create a tibble that contains all possible dates for these events this year.
datedb <- tibble(date = seq(make_date(2021, 9, 1), make_date(2021, 12, 31), by = 1),
wday = wday(date, label = TRUE, abbr = FALSE))
# Write function meant to determine event date for each row of the dataframe.
determine_date <- function(df){
# define variables that are easier to read
# this part makes me squeamish -
# there's gotta be a better way to do this with the tidyverse
event_date_exact <- df[["date_start"]]
event_near_wday <- df[["weekday_near"]]
event_near_desc <- df[["near_description"]]
event_near_date <- df[["near_date"]]
# Event date - if there is an exact date for the event, choose it as the event date.
if (!is.na(event_date_exact)) {
event_date <- event_date_exact
# Otherwise, if the date is dependent on another date, figure out when it should be:
} else {
event_date_vec <- datedb %>% filter(wday == event_near_wday) %>% pull(date)
event_date <-
case_when(
# If you're looking for the closest weekday to a particular date:
event_near_desc == "closest to" ~ event_date_vec[which(abs(event_date_vec - event_near_date) ==
min(abs(event_date_vec - event_near_date), na.rm = TRUE))],
# If you're looking for the first weekday before that weekday
event_near_desc == "before" ~ rev(event_date_vec[which(event_date_vec - event_near_date < 0)])[1],
# If neither of these worked, output NA to check why
TRUE ~ NA_Date_
)
}
}
# create empty vector to store results
start_dates <- lubridate::ymd()
for (i in 1:nrow(demo)) {
start_dates[i] <- determine_date(demo[i,])
}
# add start dates back to original demo dataframe
demo$start_date <- start_dates
期望的输出:
注意新的start_date
列
demo
Event date_start weekday_near near_description near_date start_date
Gala 2021-09-01 NA NA NA 2021-09-01
Celebration NA Saturday before 2021-10-01 2021-09-25
Wrap-up NA Friday closest to 2021-12-01 2021-12-03
如果您想向量化一个函数,实际上它只是调用 mapply
。所以,如果你想使用 purrr
风格编码,你可能只想修改你的函数参数如下:
设置:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
# Bring in demo data that describes 3 events, and when they should each start.
demo <- structure(list(Event = c("Gala", "Celebration", "Wrap-up"), date_start
= structure(c(18871, NA, NA), class = "Date"), weekday_near = c(NA,
"Saturday", "Friday" ), near_description = c(NA, "before", "closest to"),
near_date = structure(c(NA, 18901, 18962), class = "Date")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
datedb <- tibble(date = seq(make_date(2021, 9, 1), make_date(2021, 12, 31), by = 1),
wday = wday(date, label = TRUE, abbr = FALSE))
这是您函数的重构版本。
使用 case_when
而不是 switch
语句完全取决于您。我选择使用 switch
因为此函数旨在在 pmap 调用中调用,即我们希望它仅检查单个值。
#write a function that expects 4 input values
#vectorize/pmap over each.
determine_date2 <- function(date_start, weekday_near, near_desc, near_date){
event_vec <- datedb %>% filter(wday == weekday_near) %>% pull(date)
event_date <-
if(!is.na(date_start)){
date_start
} else if(!is.na(near_desc)){
switch(
near_desc,
`closest to` = event_vec[which(abs(event_vec - near_date) == min(abs(event_vec - near_date), na.rm = TRUE))],
before = rev(event_vec[which(event_vec - near_date < 0)])[1],
NA_Date_
)
} else {
NA_Date_
}
event_date
}
实际上我刚刚发现并没有真正的 pmap_date
变体,但我在下面制作的应该足以作为替代品。
pmap_date <- function(.l, .f, ...){
res <- pmap(.l, .f, ...)
check_res <- map_lgl(res, ~is.Date(.x)&&is_scalar_vector(.x))
if(!all(check_res)){
rlang::abort(glue::glue("all results must return a scalar date. offending entries: ",glue::glue_collapse("{!which(check_res)}", sep = ", ")))
}
return(reduce(res, c))
}
现在我们应该可以在 mutate 函数中使用 pmap_date
demo %>%
mutate(
start_dates = pmap_date(list(date_start, weekday_near, near_description, near_date), determine_date2)
)
#> # A tibble: 3 x 6
#> Event date_start weekday_near near_description near_date start_dates
#> <chr> <date> <chr> <chr> <date> <date>
#> 1 Gala 2021-09-01 <NA> <NA> NA 2021-09-01
#> 2 Celebration NA Saturday before 2021-10-01 2021-09-25
#> 3 Wrap-up NA Friday closest to 2021-12-01 2021-12-03
如果您愿意,可以使“矢量化”包装函数就像您自己调用 Vectorize
函数一样:
v_determine_date2 <- function(date_start, weekday_near, near_desc, near_date) pmap_date(list(date_start, weekday_near, near_desc, near_date), determine_date2)
demo %>%
mutate(
start_dates = v_determine_date2(date_start, weekday_near, near_description, near_date)
)
#> # A tibble: 3 x 6
#> Event date_start weekday_near near_description near_date start_dates
#> <chr> <date> <chr> <chr> <date> <date>
#> 1 Gala 2021-09-01 <NA> <NA> NA 2021-09-01
#> 2 Celebration NA Saturday before 2021-10-01 2021-09-25
#> 3 Wrap-up NA Friday closest to 2021-12-01 2021-12-03
由 reprex package (v1.0.0)
于 2021-05-11 创建