递归添加日期直到满足条件
Recursively add dates until condition is met
我有一个数据集如下:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
set.seed(2021)
df <- tibble(
customer = seq(1:6),
start_date = sample(seq(as.Date('2020-01-01'),
as.Date('2020-12-31'),
by = "day"), 6),
end_date = c(sample(seq(as.Date('2021-01-01'),
as.Date('2021-02-28'),
by = "day"), 3), NA, NA, NA))
> df
# A tibble: 6 x 3
customer start_date end_date
<int> <date> <date>
1 1 2020-06-14 2021-02-15
2 2 2020-08-18 2021-01-05
3 3 2020-03-10 2021-02-16
4 4 2020-07-10 NA
5 5 2020-09-07 NA
6 6 2020-04-11 NA
这个功能的objective是按月检查每个客户是否流失。这是在 PostgreSQL here 中完成的,但我正在尝试将其翻译成 R(最好是 tidyverse)。
这些是参数:
obs_start <- as.Date(start_date)
obs_interval <- months(1)
lead_time <- weeks(1)
obs_date <- obs_start + obs_interval - lead_time
obs_end <- obs_date %m+% months(3)
对于给定的观察期(obs_start
和 obs_end
),我希望插入日期并检查客户是否流失。此日期插入将持续到
obs_end
日期已到
- 当
end_date >= obs_date
& end_date < 下一个 obs_date 时,客户被标记 is_churn = TRUE
,其中下一个 obs_date 未打印。
我做了一些挖掘,似乎 purrr:accumulate()
可用于递归添加日期并使用 done()
提前终止,但我完全不知道如何将其合并为一个(或多个)更小)功能。
这是我想要的输出:
# A tibble: 22 x 5
customer start_date end_date obs_date is_churn
<int> <chr> <chr> <chr> <lgl>
1 1 2020-06-14 2021-02-15 2020-11-17 FALSE
2 1 2020-06-14 2021-02-15 2020-12-18 FALSE
3 1 2020-06-14 2021-02-15 2021-01-18 FALSE
4 1 2020-06-14 2021-02-15 2021-02-15 TRUE
5 2 2020-08-18 2021-01-05 2020-11-17 FALSE
6 2 2020-08-18 2021-01-05 2020-12-18 TRUE
7 3 2020-03-10 2021-02-16 2020-11-17 FALSE
8 3 2020-03-10 2021-02-16 2020-12-18 FALSE
9 3 2020-03-10 2021-02-16 2021-01-18 FALSE
10 3 2020-03-10 2021-02-16 2021-02-15 TRUE
11 4 2020-07-10 NA 2020-11-17 FALSE
12 4 2020-07-10 NA 2020-12-18 FALSE
13 4 2020-07-10 NA 2021-01-18 FALSE
14 4 2020-07-10 NA 2021-02-15 FALSE
15 5 2020-09-07 NA 2020-11-17 FALSE
16 5 2020-09-07 NA 2020-12-18 FALSE
17 5 2020-09-07 NA 2021-01-18 FALSE
18 5 2020-09-07 NA 2021-02-15 FALSE
19 6 2020-04-11 NA 2020-11-17 FALSE
20 6 2020-04-11 NA 2020-12-18 FALSE
21 6 2020-04-11 NA 2021-01-18 FALSE
22 6 2020-04-11 NA 2021-02-15 FALSE
这是否至少在一定程度上回答了您的问题?它为您提供每个客户的日期系列中每个日期的状态(即“尚未成为客户”、“客户”、“已流失”)。
# date series by month starting from the min date until the max date
tibble(
reference_date = seq(min(df$start_date), max(df$end_date, na.rm = T), by = "months")
) %>%
# the original df is assigned to each date from the date series
crossing(df) %>%
# the status of the customer is checked for each date
mutate(
status = case_when(
reference_date < start_date ~ "not yet customer",
is.na(end_date) | reference_date <= end_date ~ "customer",
T ~ "churned"
)
)
我有一个数据集如下:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
set.seed(2021)
df <- tibble(
customer = seq(1:6),
start_date = sample(seq(as.Date('2020-01-01'),
as.Date('2020-12-31'),
by = "day"), 6),
end_date = c(sample(seq(as.Date('2021-01-01'),
as.Date('2021-02-28'),
by = "day"), 3), NA, NA, NA))
> df
# A tibble: 6 x 3
customer start_date end_date
<int> <date> <date>
1 1 2020-06-14 2021-02-15
2 2 2020-08-18 2021-01-05
3 3 2020-03-10 2021-02-16
4 4 2020-07-10 NA
5 5 2020-09-07 NA
6 6 2020-04-11 NA
这个功能的objective是按月检查每个客户是否流失。这是在 PostgreSQL here 中完成的,但我正在尝试将其翻译成 R(最好是 tidyverse)。
这些是参数:
obs_start <- as.Date(start_date)
obs_interval <- months(1)
lead_time <- weeks(1)
obs_date <- obs_start + obs_interval - lead_time
obs_end <- obs_date %m+% months(3)
对于给定的观察期(obs_start
和 obs_end
),我希望插入日期并检查客户是否流失。此日期插入将持续到
obs_end
日期已到- 当
end_date >= obs_date
& end_date < 下一个 obs_date 时,客户被标记is_churn = TRUE
,其中下一个 obs_date 未打印。
我做了一些挖掘,似乎 purrr:accumulate()
可用于递归添加日期并使用 done()
提前终止,但我完全不知道如何将其合并为一个(或多个)更小)功能。
这是我想要的输出:
# A tibble: 22 x 5
customer start_date end_date obs_date is_churn
<int> <chr> <chr> <chr> <lgl>
1 1 2020-06-14 2021-02-15 2020-11-17 FALSE
2 1 2020-06-14 2021-02-15 2020-12-18 FALSE
3 1 2020-06-14 2021-02-15 2021-01-18 FALSE
4 1 2020-06-14 2021-02-15 2021-02-15 TRUE
5 2 2020-08-18 2021-01-05 2020-11-17 FALSE
6 2 2020-08-18 2021-01-05 2020-12-18 TRUE
7 3 2020-03-10 2021-02-16 2020-11-17 FALSE
8 3 2020-03-10 2021-02-16 2020-12-18 FALSE
9 3 2020-03-10 2021-02-16 2021-01-18 FALSE
10 3 2020-03-10 2021-02-16 2021-02-15 TRUE
11 4 2020-07-10 NA 2020-11-17 FALSE
12 4 2020-07-10 NA 2020-12-18 FALSE
13 4 2020-07-10 NA 2021-01-18 FALSE
14 4 2020-07-10 NA 2021-02-15 FALSE
15 5 2020-09-07 NA 2020-11-17 FALSE
16 5 2020-09-07 NA 2020-12-18 FALSE
17 5 2020-09-07 NA 2021-01-18 FALSE
18 5 2020-09-07 NA 2021-02-15 FALSE
19 6 2020-04-11 NA 2020-11-17 FALSE
20 6 2020-04-11 NA 2020-12-18 FALSE
21 6 2020-04-11 NA 2021-01-18 FALSE
22 6 2020-04-11 NA 2021-02-15 FALSE
这是否至少在一定程度上回答了您的问题?它为您提供每个客户的日期系列中每个日期的状态(即“尚未成为客户”、“客户”、“已流失”)。
# date series by month starting from the min date until the max date
tibble(
reference_date = seq(min(df$start_date), max(df$end_date, na.rm = T), by = "months")
) %>%
# the original df is assigned to each date from the date series
crossing(df) %>%
# the status of the customer is checked for each date
mutate(
status = case_when(
reference_date < start_date ~ "not yet customer",
is.na(end_date) | reference_date <= end_date ~ "customer",
T ~ "churned"
)
)