遍历日期间隔以创建多个变量
loop through date intervals to create multiple variables
假设我有以下数据集:
library(lubridate)
library(tidyverse)
df <- data.frame(date1 = c("2011-09-18", "2013-03-06", "2013-08-08"),
date2 = c("2012-02-18", "2014-03-06", "2015-02-03"))
df$date1 <- as.Date(parse_date_time(df$date1, "ymd"))
df$date2 <- as.Date(parse_date_time(df$date2, "ymd"))
df
# date1 date2
# 1 2011-09-18 2012-02-18
# 2 2013-03-06 2014-03-06
# 3 2013-08-08 2015-02-03
我想创建指标变量来判断年份是否与日期之间的间隔完全相关。例如,第三个观测值与 2013、2014、2015 相关联。另外,如果特定日期在间隔内,我想创建变量,例如每年的4月1日。
期望输出:
date1 date2 y_2011 y_2012 y_2013 y_2014 y_2015 y_1st_2011 y_1st_2012 y_1st_2013 y_1st_2014 y_1st_2015
1 2011-09-18 2012-02-18 1 1 0 0 0 0 0 0 0 0
2 2013-03-06 2014-03-06 0 0 1 1 0 0 0 1 0 0
3 2013-08-08 2015-02-03 0 0 1 1 1 0 0 0 1 0
手动我可以通过这样的方式来做到这一点:
#is 2011 associated with dates
df$y_2011 <- if_else(year(df$date1) == 2011, 1, 0, as.numeric(NA))
#is 2014 associated with dates
df$y_2014 <- if_else(between(2014, year(df$date1), year(df$date2)), 1, 0, as.numeric(NA))
#is particular date (2014-04-01) within interval
df$y_1st_2014 <- if_else(between("2014-04-01", df$date1, df$date2), 1, 0, as.numeric(NA))
我想把它放到一个函数中,这样它会更加自动化:
#particular date, 1st of April of each year
b <- seq(as.Date("2011-04-01"), by = "year", length.out = 5)
b
#[1] "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01" "2015-01-01"
#for year
a <- c(2011:2015)
[1] 2011 2012 2013 2014 2015
df[paste0("y_", a)] <- lapply(a, function(x) if_else(between(a,
year(df$date1), year(df$date2)), 1, 0, as.numeric(NA)))
有什么建议吗?最好使用 dplyr
/purrr
解决方案。
参考:
这是创建与日期范围关联的年份矩阵的解决方案:
library(lubridate)
library(tidyr)
library(dplyr)
df <- data.frame(date1 = c("2011-09-18", "2013-03-06", "2013-08-08"),
date2 = c("2012-02-18", "2014-03-06", "2015-02-03"))
df$date1 <- as.Date(parse_date_time(df$date1, "ymd"))
df$date2 <- as.Date(parse_date_time(df$date2, "ymd"))
#identify the years associated with each row.
df$year<-sapply(1:nrow(df), function(i){
paste(seq(as.numeric(format(df$date1[i], "%Y")),
as.numeric(format(df$date2[i], "%Y"))), collapse = ",")})
#separate and convert to wide format
df %>% separate_rows( year, sep=",") %>%
mutate(value=1) %>%
spread(key=year, value=value, fill=0)
# date1 date2 2011 2012 2013 2014 2015
# 1 2011-09-18 2012-02-18 1 1 0 0 0
# 2 2013-03-06 2014-03-06 0 0 1 1 0
# 3 2013-08-08 2015-02-03 0 0 1 1 1
使用 between
函数是测试特定日期是否在范围内的可行选项。
假设我有以下数据集:
library(lubridate)
library(tidyverse)
df <- data.frame(date1 = c("2011-09-18", "2013-03-06", "2013-08-08"),
date2 = c("2012-02-18", "2014-03-06", "2015-02-03"))
df$date1 <- as.Date(parse_date_time(df$date1, "ymd"))
df$date2 <- as.Date(parse_date_time(df$date2, "ymd"))
df
# date1 date2
# 1 2011-09-18 2012-02-18
# 2 2013-03-06 2014-03-06
# 3 2013-08-08 2015-02-03
我想创建指标变量来判断年份是否与日期之间的间隔完全相关。例如,第三个观测值与 2013、2014、2015 相关联。另外,如果特定日期在间隔内,我想创建变量,例如每年的4月1日。
期望输出:
date1 date2 y_2011 y_2012 y_2013 y_2014 y_2015 y_1st_2011 y_1st_2012 y_1st_2013 y_1st_2014 y_1st_2015
1 2011-09-18 2012-02-18 1 1 0 0 0 0 0 0 0 0
2 2013-03-06 2014-03-06 0 0 1 1 0 0 0 1 0 0
3 2013-08-08 2015-02-03 0 0 1 1 1 0 0 0 1 0
手动我可以通过这样的方式来做到这一点:
#is 2011 associated with dates
df$y_2011 <- if_else(year(df$date1) == 2011, 1, 0, as.numeric(NA))
#is 2014 associated with dates
df$y_2014 <- if_else(between(2014, year(df$date1), year(df$date2)), 1, 0, as.numeric(NA))
#is particular date (2014-04-01) within interval
df$y_1st_2014 <- if_else(between("2014-04-01", df$date1, df$date2), 1, 0, as.numeric(NA))
我想把它放到一个函数中,这样它会更加自动化:
#particular date, 1st of April of each year
b <- seq(as.Date("2011-04-01"), by = "year", length.out = 5)
b
#[1] "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01" "2015-01-01"
#for year
a <- c(2011:2015)
[1] 2011 2012 2013 2014 2015
df[paste0("y_", a)] <- lapply(a, function(x) if_else(between(a,
year(df$date1), year(df$date2)), 1, 0, as.numeric(NA)))
有什么建议吗?最好使用 dplyr
/purrr
解决方案。
参考:
这是创建与日期范围关联的年份矩阵的解决方案:
library(lubridate)
library(tidyr)
library(dplyr)
df <- data.frame(date1 = c("2011-09-18", "2013-03-06", "2013-08-08"),
date2 = c("2012-02-18", "2014-03-06", "2015-02-03"))
df$date1 <- as.Date(parse_date_time(df$date1, "ymd"))
df$date2 <- as.Date(parse_date_time(df$date2, "ymd"))
#identify the years associated with each row.
df$year<-sapply(1:nrow(df), function(i){
paste(seq(as.numeric(format(df$date1[i], "%Y")),
as.numeric(format(df$date2[i], "%Y"))), collapse = ",")})
#separate and convert to wide format
df %>% separate_rows( year, sep=",") %>%
mutate(value=1) %>%
spread(key=year, value=value, fill=0)
# date1 date2 2011 2012 2013 2014 2015
# 1 2011-09-18 2012-02-18 1 1 0 0 0
# 2 2013-03-06 2014-03-06 0 0 1 1 0
# 3 2013-08-08 2015-02-03 0 0 1 1 1
使用 between
函数是测试特定日期是否在范围内的可行选项。