递归添加日期直到满足条件

Recursively add dates until condition is met

我有一个数据集如下:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

set.seed(2021)
df <- tibble(
  customer = seq(1:6),
  start_date = sample(seq(as.Date('2020-01-01'),
                          as.Date('2020-12-31'),
                          by = "day"), 6),
  end_date = c(sample(seq(as.Date('2021-01-01'),
                          as.Date('2021-02-28'),
                          by = "day"), 3), NA, NA, NA))
> df
# A tibble: 6 x 3
  customer start_date end_date  
     <int> <date>     <date>    
1        1 2020-06-14 2021-02-15
2        2 2020-08-18 2021-01-05
3        3 2020-03-10 2021-02-16
4        4 2020-07-10 NA        
5        5 2020-09-07 NA        
6        6 2020-04-11 NA 

这个功能的objective是按月检查每个客户是否流失。这是在 PostgreSQL here 中完成的,但我正在尝试将其翻译成 R(最好是 tidyverse)。

这些是参数:

obs_start <- as.Date(start_date)
obs_interval <- months(1)
lead_time <- weeks(1)

obs_date <- obs_start + obs_interval - lead_time
obs_end <- obs_date %m+% months(3)

对于给定的观察期(obs_startobs_end),我希望插入日期并检查客户是否流失。此日期插入将持续到

  1. obs_end 日期已到
  2. end_date >= obs_date & end_date < 下一个 obs_date 时,客户被标记 is_churn = TRUE,其中下一个 obs_date 未打印。

我做了一些挖掘,似乎 purrr:accumulate() 可用于递归添加日期并使用 done() 提前终止,但我完全不知道如何将其合并为一个(或多个)更小)功能。

这是我想要的输出:

# A tibble: 22 x 5
   customer start_date end_date   obs_date   is_churn
      <int> <chr>      <chr>      <chr>      <lgl>   
 1        1 2020-06-14 2021-02-15 2020-11-17 FALSE   
 2        1 2020-06-14 2021-02-15 2020-12-18 FALSE   
 3        1 2020-06-14 2021-02-15 2021-01-18 FALSE   
 4        1 2020-06-14 2021-02-15 2021-02-15 TRUE    
 5        2 2020-08-18 2021-01-05 2020-11-17 FALSE   
 6        2 2020-08-18 2021-01-05 2020-12-18 TRUE    
 7        3 2020-03-10 2021-02-16 2020-11-17 FALSE   
 8        3 2020-03-10 2021-02-16 2020-12-18 FALSE   
 9        3 2020-03-10 2021-02-16 2021-01-18 FALSE   
10        3 2020-03-10 2021-02-16 2021-02-15 TRUE    
11        4 2020-07-10 NA         2020-11-17 FALSE   
12        4 2020-07-10 NA         2020-12-18 FALSE   
13        4 2020-07-10 NA         2021-01-18 FALSE   
14        4 2020-07-10 NA         2021-02-15 FALSE   
15        5 2020-09-07 NA         2020-11-17 FALSE   
16        5 2020-09-07 NA         2020-12-18 FALSE   
17        5 2020-09-07 NA         2021-01-18 FALSE   
18        5 2020-09-07 NA         2021-02-15 FALSE   
19        6 2020-04-11 NA         2020-11-17 FALSE   
20        6 2020-04-11 NA         2020-12-18 FALSE   
21        6 2020-04-11 NA         2021-01-18 FALSE   
22        6 2020-04-11 NA         2021-02-15 FALSE 

这是否至少在一定程度上回答了您的问题?它为您提供每个客户的日期系列中每个日期的状态(即“尚未成为客户”、“客户”、“已流失”)。

# date series by month starting from the min date until the max date
tibble(
  reference_date = seq(min(df$start_date), max(df$end_date, na.rm = T), by = "months")
) %>%
  # the original df is assigned to each date from the date series
  crossing(df) %>%
  # the status of the customer is checked for each date
  mutate(
    status = case_when(
      reference_date < start_date ~ "not yet customer",
      is.na(end_date) | reference_date <= end_date ~ "customer",
      T ~ "churned"
    )
  )