使用 dplyr 从开始停止范围变量按月汇总计数?

Use dplyr to aggregate counts by month from start-stop ranged variables?

假设我有以这种格式存储的学校注册数据,包括开始日期和结束日期字段:

unique_name enrollment_start enrollment_end
Amy 1, Jan, 2017 30, Sep 2018
Franklin 1, Jan, 2017 19, Feb, 2017
Franklin 5, Jun, 2017 4, Feb, 2018
Franklin 21, Oct, 2018 9, Mar, 2019
Samir 1, Jun, 2017 4, Feb, 2017
Samir 5, Apr, 2017 12, Sep, 2018
... ... ...

我想像这样按月生成注册总数:

month enrollment_count
Jan, 2017 25
Feb, 2017 31
Mar, 2017 19
Apr, 2017 34
May, 2017 29
Jun, 2017 32
... ...

有没有简单的方法用 dplyr 完成这个?

我能想到的唯一方法是遍历范围从 month_min 到 month_max 的所有月份的列表,以计算开始日期或结束日期落在其中的行数每个月。希望代码更简单。

创建一个列表列,其中包含每组日期之间的月份序列,然后取消嵌套并计数。

备注:

  1. 我使用 lubridate::floor_date()enrollment_start 四舍五入到该月的第一天。否则,如果 enrollment_start 是当月的 29 日或更晚,seq() 可能会跳过月份。
  2. 示例数据的第五行 enrollment_start 晚于 enrollment_end -- 我认为这是一个错误并已删除。
library(tidyverse)
library(lubridate)

enrollments %>% 
  mutate(
    across(c(enrollment_start, enrollment_end), dmy),  # convert to date
    month = map2(
      floor_date(enrollment_start, unit = "month"),    # round to 1st day
      enrollment_end,
      ~ seq(.x, .y, by = "month")
    )
  ) %>% 
  unnest_longer(month) %>% 
  count(month, name = "enrollment_count")

#> # A tibble: 27 x 2
#>    month      enrollment_count
#>    <date>                <int>
#>  1 2017-01-01                2
#>  2 2017-02-01                2
#>  3 2017-03-01                1
#>  4 2017-04-01                2
#>  5 2017-05-01                2
#>  6 2017-06-01                3
#>  7 2017-07-01                3
#>  8 2017-08-01                3
#>  9 2017-09-01                3
#> 10 2017-10-01                3
#> # ... with 17 more rows

reprex package (v2.0.1)

于 2022-03-25 创建

这是我对 dplyrtidyr 的看法。

  1. 旋转数据为每个学生创建多行并设置日期格式。
  2. 对学生进行分组并使用 complete 生成缺失的月份。
  3. 对生成的经期进行分组并计数。
data %>%
  pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
    mutate(value = as.Date(value, format =  "%d, %B, %Y")) %>%
    mutate(value = lubridate::floor_date(value, 'month')) %>%
  
#   unique_name name             value     
#   <chr>       <chr>            <date>    
# 1 Amy         enrollment_start 2017-01-01
# 2 Amy         enrollment_end   2018-09-30
# 3 Franklin    enrollment_start 2017-01-01
# 4 Franklin    enrollment_end   2017-02-19
#   ..etc.

  group_by(unique_name) %>%
  complete(value = seq.Date(min(value), max(value), by="month")) %>%
  arrange(unique_name, value) 

enrollment_count <- group_by(data, value) %>%
  count()

编辑:我忘记将日期设置为底数,以便在最后正确地汇总每个时期。添加 lubridate 中的 floor_date 以执行此操作。

我认为这可以通过 clock and ivs 包非常优雅地完成。您似乎想要每月计数,因此您可以使用时钟中的 year-month 类型。 ivs 是一个专门用于处理 intervals 数据的包,这正是您在这里所拥有的。在此我们假设如果您的注册人数 start/end 在一个月内下降,那么您应该在该月被视为活跃。

library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)

enrollments <- tribble(
  ~unique_name, ~enrollment_start, ~enrollment_end,
  "Amy",        "1, Jan, 2017",    "30, Sep, 2018",
  "Franklin",   "1, Jan, 2017",    "19, Feb, 2017",
  "Franklin",   "5, Jun, 2017",    "4, Feb, 2018",
  "Franklin",   "21, Oct, 2018",   "9, Mar, 2019",
  "Samir",      "1, Jan, 2017",    "4, Feb, 2017",
  "Samir",      "5, Apr, 2017",    "12, Sep, 2018"
)

# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
  mutate(
    start = enrollment_start %>%
      year_month_day_parse(format = "%d, %b, %Y") %>%
      calendar_narrow("month"),
    end = enrollment_end %>%
      year_month_day_parse(format = "%d, %b, %Y") %>%
      calendar_narrow("month") %>%
      add_months(1),
    .keep = "unused"
  )

enrollments
#> # A tibble: 6 × 3
#>   unique_name start        end         
#>   <chr>       <ymd<month>> <ymd<month>>
#> 1 Amy         2017-01      2018-10     
#> 2 Franklin    2017-01      2017-03     
#> 3 Franklin    2017-06      2018-03     
#> 4 Franklin    2018-10      2019-04     
#> 5 Samir       2017-01      2017-03     
#> 6 Samir       2017-04      2018-10

# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
  mutate(active = iv(start, end), .keep = "unused")

enrollments
#> # A tibble: 6 × 2
#>   unique_name             active
#>   <chr>         <iv<ymd<month>>>
#> 1 Amy         [2017-01, 2018-10)
#> 2 Franklin    [2017-01, 2017-03)
#> 3 Franklin    [2017-06, 2018-03)
#> 4 Franklin    [2018-10, 2019-04)
#> 5 Samir       [2017-01, 2017-03)
#> 6 Samir       [2017-04, 2018-10)

# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L

months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#>    month       
#>    <ymd<month>>
#>  1 2017-01     
#>  2 2017-02     
#>  3 2017-03     
#>  4 2017-04     
#>  5 2017-05     
#>  6 2017-06     
#>  7 2017-07     
#>  8 2017-08     
#>  9 2017-09     
#> 10 2017-10     
#> # … with 17 more rows

# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
  mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#>    month        count
#>    <ymd<month>> <int>
#>  1 2017-01          3
#>  2 2017-02          3
#>  3 2017-03          1
#>  4 2017-04          2
#>  5 2017-05          2
#>  6 2017-06          3
#>  7 2017-07          3
#>  8 2017-08          3
#>  9 2017-09          3
#> 10 2017-10          3
#> # … with 17 more rows

reprex package (v2.0.1)

于 2022-04-05 创建