获取具有可变每日读数的时间序列数据中的所有可能组合

Get all possible combinations in a time-series data with variable daily readings

我有一个日常消费的时间序列数据集,如下所示:

consumption <- data.frame(
  date = as.Date(c('2020-06-01','2020-06-02','2020-06-03','2020-06-03',
                   '2020-06-03','2020-06-04','2020-06-05','2020-06-05')),
  val = c(10,20,31,32,33,40,51,52)
)

consumption <- consumption %>%
  group_by(date) %>%
  mutate(n = n(), record = row_number()) %>%
  ungroup()

consumption

# A tibble: 8 × 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-03    32     3      2
5 2020-06-03    33     3      3
6 2020-06-04    40     1      1
7 2020-06-05    51     2      1
8 2020-06-05    52     2      2

有些日子在数据集中有不止一行。我想将其转换为具有所有可能组合的拆分组,例如:

第 1 组:

        date val record
1 2020-06-01  10      1
2 2020-06-02  20      1
3 2020-06-03  31      1
4 2020-06-04  40      1
5 2020-06-05  51      1

第 2 组:

        date val record
1 2020-06-01  10      1
2 2020-06-02  20      1
3 2020-06-03  31      1
4 2020-06-04  40      1
5 2020-06-05  52      2

第 3 组:

        date val record
1 2020-06-01  10      1
2 2020-06-02  20      1
3 2020-06-03  32      2
4 2020-06-04  40      1
5 2020-06-05  51      1

第 4 组:

        date val record
1 2020-06-01  10      1
2 2020-06-02  20      1
3 2020-06-03  32      2
4 2020-06-04  40      1
5 2020-06-05  52      2

第 5 组:

        date val record
1 2020-06-01  10      1
2 2020-06-02  20      1
3 2020-06-03  33      3
4 2020-06-04  40      1
5 2020-06-05  51      1

第 6 组:

        date val record
1 2020-06-01  10      1
2 2020-06-02  20      1
3 2020-06-03  33      3
4 2020-06-04  40      1
5 2020-06-05  52      2

我尝试了以下解决方案,但没有产生预期的结果。

library(dplyr)
library(purrr)
out <- consumption %>% 
   filter(n > 1) %>%
    group_split(date, rn = row_number()) %>% 
    map(~ bind_rows(consumption %>%
          filter(n == 1), .x %>%
             select(-rn)) %>% 
         arrange(date))

非常感谢您帮助解决这个问题。

非常感谢,

我们可以 filter 其中 'record' 大于 1,group_split 通过 'row_number' 和 'date',然后用 [= 绑定行14=]ed 数据,其中 'record' 为 1

library(dplyr)
library(purrr)
out <- consumption %>% 
   filter(n > 1) %>%
    group_split(date, rn = row_number()) %>% 
    map(~ bind_rows(consumption %>%
          filter(n == 1), .x %>%
             select(-rn)) %>% 
         arrange(date))

-输出

> out
[[1]]
# A tibble: 4 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1

[[2]]
# A tibble: 4 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1

[[3]]
# A tibble: 4 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1

使用更新后的数据,我们创建 row_number(),然后通过 'date' 列创建 split(如@ThomasIsCoding 解决方案),使用 crossing(来自 purrr)展开数据,根据行索引

循环遍历pmapslice原始数据的行
library(tidyr)
library(tibble)
consumption %>%
     transmute(date, rn = row_number()) %>%
     deframe %>%
     split(names(.)) %>%
     invoke(crossing, .) %>%
     pmap(~ consumption %>% 
        slice(c(...))) %>%
     unname

-输出

[[1]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[2]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

[[3]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[4]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

[[5]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[6]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

也许你可以试试下面的代码

with(
  consumption,
  apply(
    expand.grid(
      split(seq_along(date), date)
    ),
    1,
    function(k) consumption[k, ]
  )
)

这给出了

[[1]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[2]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[3]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[4]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

[[5]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

[[6]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

这是使用一些基本 dplyrtidyr 函数的方法。

首先,完成每个日期/副本组合的数据。然后用先验值填充缺失的,reshape wide。

library(tidyverse)
consumption %>%
   complete(date, record) %>%
   group_by(date) %>% fill(val) %>% ungroup() %>%
   pivot_wider(-n, names_from = record, values_from = val)

# A tibble: 5 x 4
  date         `1`   `2`   `3`
  <date>     <dbl> <dbl> <dbl>
1 2020-06-01    10    10    10
2 2020-06-02    20    20    20
3 2020-06-03    31    32    33
4 2020-06-04    40    40    40
5 2020-06-05    51    52    52