Unstack lubridate 的时间间隔 class

Unstack lubridate's interval class

我正在尝试转换一个数据框 df,该数据框由一个 value 列、两个日期列(startend)和一个间隔列组成(duration) 通过 unnesting/unstacking duration 列转换为长格式。

library(dplyr)
library(lubridate)

df <- data.frame(value = letters[1:3], start = as_date(1:3), end = as_date(3:1)+3) %>% 
          mutate(duration = interval(start, end))

预期结果将是一个数据框,其中 valuestartend 根据 duration 定义的每一天进行复制。例如,值 'a' 会在不同的一天(1970 年 1 月 2 日、3 日、4 日、5 日、6 日、7 日)每次出现 6 次。

我尝试使用 tidyr 包中的 unnest 函数,但没有任何反应。

tidyr::unnest(df, duration) 

非常感谢任何帮助:)

我认为 interval 没有帮助 - seq.Date 可能更好...

library(purrr) #as well as those you have

df <- data.frame(value = letters[1:3], start = as_date(1:3), end = as_date(3:1)+3) %>% 
   mutate(day = map2(start, end, seq.Date, by = "day")) %>% 
   unnest(day)

df
# A tibble: 12 x 4
   value start      end        day       
   <chr> <date>     <date>     <date>    
 1 a     1970-01-02 1970-01-07 1970-01-02
 2 a     1970-01-02 1970-01-07 1970-01-03
 3 a     1970-01-02 1970-01-07 1970-01-04
 4 a     1970-01-02 1970-01-07 1970-01-05
 5 a     1970-01-02 1970-01-07 1970-01-06
 6 a     1970-01-02 1970-01-07 1970-01-07
 7 b     1970-01-03 1970-01-06 1970-01-03
 8 b     1970-01-03 1970-01-06 1970-01-04
 9 b     1970-01-03 1970-01-06 1970-01-05
10 b     1970-01-03 1970-01-06 1970-01-06
11 c     1970-01-04 1970-01-05 1970-01-04
12 c     1970-01-04 1970-01-05 1970-01-05

要从间隔中提取开始和结束日期,您可以使用 int_startint_end,使用 map2unnest 创建一个日期序列。

library(dplyr)
library(purrr)
library(tidyr)
library(lubridate)

df %>%
  mutate(date = map2(int_start(duration), int_end(duration), 
                ~seq(as.Date(.x), as.Date(.y), by = 'day'))) %>%
  #This will also work but would return date of class POSIXct
  #mutate(date = map2(int_start(duration), int_end(duration),seq,by = 'day')) %>%
  unnest(date) %>%
  select(-duration)

#    value start      end        date      
#   <chr> <date>     <date>     <date>    
# 1 a     1970-01-02 1970-01-07 1970-01-02
# 2 a     1970-01-02 1970-01-07 1970-01-03
# 3 a     1970-01-02 1970-01-07 1970-01-04
# 4 a     1970-01-02 1970-01-07 1970-01-05
# 5 a     1970-01-02 1970-01-07 1970-01-06
# 6 a     1970-01-02 1970-01-07 1970-01-07
# 7 b     1970-01-03 1970-01-06 1970-01-03
# 8 b     1970-01-03 1970-01-06 1970-01-04
# 9 b     1970-01-03 1970-01-06 1970-01-05
#10 b     1970-01-03 1970-01-06 1970-01-06
#11 c     1970-01-04 1970-01-05 1970-01-04
#12 c     1970-01-04 1970-01-05 1970-01-05

您也可以使用以下解决方案。因为我们要创建重复的行,所以我们可以将操作包装在一个列表中,然后使用 unnest_longerpurrr 包函数一直是我的首选,但你也可以使用它作为替代。

library(dplyr)
library(tidyr)
library(lubridate)


df %>% 
  group_by(value) %>%
  mutate(date = list(start + 0:(duration/ddays(1)))) %>%
  unnest_longer(date) %>%
  select(-duration)


# A tibble: 12 x 4
# Groups:   value [3]
   value start      end        date      
   <chr> <date>     <date>     <date>    
 1 a     1970-01-02 1970-01-07 1970-01-02
 2 a     1970-01-02 1970-01-07 1970-01-03
 3 a     1970-01-02 1970-01-07 1970-01-04
 4 a     1970-01-02 1970-01-07 1970-01-05
 5 a     1970-01-02 1970-01-07 1970-01-06
 6 a     1970-01-02 1970-01-07 1970-01-07
 7 b     1970-01-03 1970-01-06 1970-01-03
 8 b     1970-01-03 1970-01-06 1970-01-04
 9 b     1970-01-03 1970-01-06 1970-01-05
10 b     1970-01-03 1970-01-06 1970-01-06
11 c     1970-01-04 1970-01-05 1970-01-04
12 c     1970-01-04 1970-01-05 1970-01-05

您不能拆开一列间隔并期望它生成其间的所有日期,但通过使用 seq 您可以自己生成它们。试试这个:

library(tidyverse)
library(lubridate)

df %>%
  rowwise() %>% 
  summarise(
    value, dates = seq(start, end, by = 1)
  )

#> # A tibble: 12 x 2
#>    value dates     
#>    <chr> <date>    
#>  1 a     1970-01-02
#>  2 a     1970-01-03
#>  3 a     1970-01-04
#>  4 a     1970-01-05
#>  5 a     1970-01-06
#>  6 a     1970-01-07
#>  7 b     1970-01-03
#>  8 b     1970-01-04
#>  9 b     1970-01-05
#> 10 b     1970-01-06
#> 11 c     1970-01-04
#> 12 c     1970-01-05

reprex package (v1.0.0)

于 2021-05-18 创建

一种data.table方法

library(data.table)
setDT(df)[, .(date = seq(start, end, by = 1)), by = .(value)]
# value       date
# 1:     a 1970-01-02
# 2:     a 1970-01-03
# 3:     a 1970-01-04
# 4:     a 1970-01-05
# 5:     a 1970-01-06
# 6:     a 1970-01-07
# 7:     b 1970-01-03
# 8:     b 1970-01-04
# 9:     b 1970-01-05
#10:     b 1970-01-06
#11:     c 1970-01-04
#12:     c 1970-01-05

uncount

df %>% uncount(as.integer(duration/(24*60*60)) +1) %>%
  group_by(value) %>%
  mutate(date = row_number() -1 + start)

# A tibble: 12 x 5
# Groups:   value [3]
   value start      end        duration                       date      
   <chr> <date>     <date>     <Interval>                     <date>    
 1 a     1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-02
 2 a     1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-03
 3 a     1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-04
 4 a     1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-05
 5 a     1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-06
 6 a     1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-07
 7 b     1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-03
 8 b     1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-04
 9 b     1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-05
10 b     1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-06
11 c     1970-01-04 1970-01-05 1970-01-04 UTC--1970-01-05 UTC 1970-01-04
12 c     1970-01-04 1970-01-05 1970-01-04 UTC--1970-01-05 UTC 1970-01-05