使用 dplyr 从开始停止范围变量按月汇总计数?
Use dplyr to aggregate counts by month from start-stop ranged variables?
假设我有以这种格式存储的学校注册数据,包括开始日期和结束日期字段:
unique_name
enrollment_start
enrollment_end
Amy
1, Jan, 2017
30, Sep 2018
Franklin
1, Jan, 2017
19, Feb, 2017
Franklin
5, Jun, 2017
4, Feb, 2018
Franklin
21, Oct, 2018
9, Mar, 2019
Samir
1, Jun, 2017
4, Feb, 2017
Samir
5, Apr, 2017
12, Sep, 2018
...
...
...
我想像这样按月生成注册总数:
month
enrollment_count
Jan, 2017
25
Feb, 2017
31
Mar, 2017
19
Apr, 2017
34
May, 2017
29
Jun, 2017
32
...
...
有没有简单的方法用 dplyr 完成这个?
我能想到的唯一方法是遍历范围从 month_min 到 month_max 的所有月份的列表,以计算开始日期或结束日期落在其中的行数每个月。希望代码更简单。
创建一个列表列,其中包含每组日期之间的月份序列,然后取消嵌套并计数。
备注:
- 我使用
lubridate::floor_date()
将 enrollment_start
四舍五入到该月的第一天。否则,如果 enrollment_start
是当月的 29 日或更晚,seq()
可能会跳过月份。
- 示例数据的第五行
enrollment_start
晚于 enrollment_end
-- 我认为这是一个错误并已删除。
library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
由 reprex package (v2.0.1)
于 2022-03-25 创建
这是我对 dplyr
和 tidyr
的看法。
- 旋转数据为每个学生创建多行并设置日期格式。
- 对学生进行分组并使用
complete
生成缺失的月份。
- 对生成的经期进行分组并计数。
data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
编辑:我忘记将日期设置为底数,以便在最后正确地汇总每个时期。添加 lubridate
中的 floor_date
以执行此操作。
我认为这可以通过 clock and ivs 包非常优雅地完成。您似乎想要每月计数,因此您可以使用时钟中的 year-month 类型。 ivs 是一个专门用于处理 intervals 数据的包,这正是您在这里所拥有的。在此我们假设如果您的注册人数 start/end 在一个月内下降,那么您应该在该月被视为活跃。
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
由 reprex package (v2.0.1)
于 2022-04-05 创建
假设我有以这种格式存储的学校注册数据,包括开始日期和结束日期字段:
unique_name | enrollment_start | enrollment_end |
---|---|---|
Amy | 1, Jan, 2017 | 30, Sep 2018 |
Franklin | 1, Jan, 2017 | 19, Feb, 2017 |
Franklin | 5, Jun, 2017 | 4, Feb, 2018 |
Franklin | 21, Oct, 2018 | 9, Mar, 2019 |
Samir | 1, Jun, 2017 | 4, Feb, 2017 |
Samir | 5, Apr, 2017 | 12, Sep, 2018 |
... | ... | ... |
我想像这样按月生成注册总数:
month | enrollment_count |
---|---|
Jan, 2017 | 25 |
Feb, 2017 | 31 |
Mar, 2017 | 19 |
Apr, 2017 | 34 |
May, 2017 | 29 |
Jun, 2017 | 32 |
... | ... |
有没有简单的方法用 dplyr 完成这个?
我能想到的唯一方法是遍历范围从 month_min 到 month_max 的所有月份的列表,以计算开始日期或结束日期落在其中的行数每个月。希望代码更简单。
创建一个列表列,其中包含每组日期之间的月份序列,然后取消嵌套并计数。
备注:
- 我使用
lubridate::floor_date()
将enrollment_start
四舍五入到该月的第一天。否则,如果enrollment_start
是当月的 29 日或更晚,seq()
可能会跳过月份。 - 示例数据的第五行
enrollment_start
晚于enrollment_end
-- 我认为这是一个错误并已删除。
library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
由 reprex package (v2.0.1)
于 2022-03-25 创建这是我对 dplyr
和 tidyr
的看法。
- 旋转数据为每个学生创建多行并设置日期格式。
- 对学生进行分组并使用
complete
生成缺失的月份。 - 对生成的经期进行分组并计数。
data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
编辑:我忘记将日期设置为底数,以便在最后正确地汇总每个时期。添加 lubridate
中的 floor_date
以执行此操作。
我认为这可以通过 clock and ivs 包非常优雅地完成。您似乎想要每月计数,因此您可以使用时钟中的 year-month 类型。 ivs 是一个专门用于处理 intervals 数据的包,这正是您在这里所拥有的。在此我们假设如果您的注册人数 start/end 在一个月内下降,那么您应该在该月被视为活跃。
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
由 reprex package (v2.0.1)
于 2022-04-05 创建