给定一个日期范围如何扩展到该范围内每月的天数?
Given a date range how to expand to the number of days per month in that range?
案例:
给定的数据框 df
具有(除其他外)一个 startDate
和一个 endDate
列。我的目=19=] 和 numberOfDaysInMonth
,都是 int 类型。
示例:
输入: df
id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
... ... ... ...
期望输出: df_res
id year month numberOfDaysInMonth someOtherCol
1 2017 9 8 val1
1 2017 10 31 val1
1 2017 11 30 val1
1 2017 12 31 val1
1 2018 1 31 val1
1 2018 2 1 val1
2 2018 1 31 val2
2 2018 2 28 val2
2 2018 3 31 val2
... ... ... ... ...
背景:
我对 R 比较陌生,但知道很棒的 dplyr
和 lubridate
包。我只是没能以一种巧妙的方式实现上述目标,即使在使用这些包时也是如此。我得到的最接近的是:Expand rows by date range using start and end date,但这不会产生范围内包含的每月天数。
非常感谢任何帮助。
如果您不介意 data.table
解决方案,您可以在按 id、someOtherCol、年份和月份聚合之前在 startDate 和 endDate 之间创建一系列连续日期,如下所示:
dat[, .(Dates=seq(startDate, endDate, by="1 day")), by=.(id, someOtherCol)][,
.N, by=.(id, someOtherCol, year(Dates), month(Dates))]
输出:
id someOtherCol year month N
1: 1 val1 2017 9 8
2: 1 val1 2017 10 31
3: 1 val1 2017 11 30
4: 1 val1 2017 12 31
5: 1 val1 2018 1 31
6: 1 val1 2018 2 1
7: 2 val2 2018 1 31
8: 2 val2 2018 2 28
9: 2 val2 2018 3 31
数据:
library(data.table)
dat <- fread("id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2")
datecols <- c("startDate", "endDate")
dat[, (datecols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=datecols]
一个tidyverse
解决方案:
# example data
df = read.table(text = "
id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
", header=T, stringsAsFactors=F)
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(startDate, endDate), ymd) %>% # update to date columns (if needed)
group_by(id) %>% # for each id
mutate(d = list(seq(startDate, endDate, by="1 day"))) %>% # create a sequence of dates (as a list)
unnest() %>% # unnest data
group_by(id, year=year(d), month=month(d), someOtherCol) %>% # group by those variables (while getting year and month of each date in the sequence)
summarise(numberOfDaysInMonth = n()) %>% # count days
ungroup() # forget the grouping
# # A tibble: 9 x 5
# id year month someOtherCol numberOfDaysInMonth
# <int> <dbl> <dbl> <chr> <int>
# 1 1 2017 9 val1 8
# 2 1 2017 10 val1 31
# 3 1 2017 11 val1 30
# 4 1 2017 12 val1 31
# 5 1 2018 1 val1 31
# 6 1 2018 2 val1 1
# 7 2 2018 1 val2 31
# 8 2 2018 2 val2 28
# 9 2 2018 3 val2 31
案例:
给定的数据框 df
具有(除其他外)一个 startDate
和一个 endDate
列。我的目=19=] 和 numberOfDaysInMonth
,都是 int 类型。
示例:
输入: df
id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
... ... ... ...
期望输出: df_res
id year month numberOfDaysInMonth someOtherCol
1 2017 9 8 val1
1 2017 10 31 val1
1 2017 11 30 val1
1 2017 12 31 val1
1 2018 1 31 val1
1 2018 2 1 val1
2 2018 1 31 val2
2 2018 2 28 val2
2 2018 3 31 val2
... ... ... ... ...
背景:
我对 R 比较陌生,但知道很棒的 dplyr
和 lubridate
包。我只是没能以一种巧妙的方式实现上述目标,即使在使用这些包时也是如此。我得到的最接近的是:Expand rows by date range using start and end date,但这不会产生范围内包含的每月天数。
非常感谢任何帮助。
如果您不介意 data.table
解决方案,您可以在按 id、someOtherCol、年份和月份聚合之前在 startDate 和 endDate 之间创建一系列连续日期,如下所示:
dat[, .(Dates=seq(startDate, endDate, by="1 day")), by=.(id, someOtherCol)][,
.N, by=.(id, someOtherCol, year(Dates), month(Dates))]
输出:
id someOtherCol year month N
1: 1 val1 2017 9 8
2: 1 val1 2017 10 31
3: 1 val1 2017 11 30
4: 1 val1 2017 12 31
5: 1 val1 2018 1 31
6: 1 val1 2018 2 1
7: 2 val2 2018 1 31
8: 2 val2 2018 2 28
9: 2 val2 2018 3 31
数据:
library(data.table)
dat <- fread("id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2")
datecols <- c("startDate", "endDate")
dat[, (datecols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=datecols]
一个tidyverse
解决方案:
# example data
df = read.table(text = "
id startDate endDate someOtherCol
1 2017-09-23 2018-02-01 val1
2 2018-01-01 2018-03-31 val2
", header=T, stringsAsFactors=F)
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(startDate, endDate), ymd) %>% # update to date columns (if needed)
group_by(id) %>% # for each id
mutate(d = list(seq(startDate, endDate, by="1 day"))) %>% # create a sequence of dates (as a list)
unnest() %>% # unnest data
group_by(id, year=year(d), month=month(d), someOtherCol) %>% # group by those variables (while getting year and month of each date in the sequence)
summarise(numberOfDaysInMonth = n()) %>% # count days
ungroup() # forget the grouping
# # A tibble: 9 x 5
# id year month someOtherCol numberOfDaysInMonth
# <int> <dbl> <dbl> <chr> <int>
# 1 1 2017 9 val1 8
# 2 1 2017 10 val1 31
# 3 1 2017 11 val1 30
# 4 1 2017 12 val1 31
# 5 1 2018 1 val1 31
# 6 1 2018 2 val1 1
# 7 2 2018 1 val2 31
# 8 2 2018 2 val2 28
# 9 2 2018 3 val2 31