在 sliding/tiled window 上应用时间序列分解(和异常检测)

Apply timeseries decomposition (and anomaly detection) over a sliding/tiled window

已发布但现已被 twitter have been separately forked and maintained in the anomalize package and the hrbrmstr/AnomalyDetection fork 放弃的异常检测方法。两者都实现了 'tidy'.

的功能

工作静态版本

tidyverse_cran_downloads %>% 
  filter(package == "tidyr") %>% 
  ungroup() %>% 
  select(-package) -> one_package_only

one_package_only %>% 
  anomalize::time_decompose(count,
                 merge = TRUE,
                 method = "twitter",
                 frequency = "7 days") -> one_package_only_decomp

one_package_only_decomp %>%
  anomalize::anomalize(remainder, method = "iqr") %>%
  anomalize::time_recompose()


one_package_only_decomp %>% 
  select(date, remainder) %>%
  AnomalyDetection::ad_ts(max_anoms = 0.02,
        direction = 'both')

这些按预期工作。

我想将 window 上的 Twitter 异常检测过程应用到我的数据集,该数据集在结构上与 anomalize::tidyverse_cran_downloads 数据集相似。一组超过 100 个值的常规观察值,按分类定义分组。

tsibble 包(它取代了旧的 tibbletime)有一种方法可以通过 slide,tile and stretch 在类似 purrr 的语法中应用函数。这可以包括根据 purrr 在另一个类似对象的数据框内返回一个完整的类似对象的数据框。 (好一句话!)

我经历了 window function vignette 但运气不佳。

尝试 1 slide2:

anomalize::decompose_twitter 函数有两个参数,datatarget

tidyverse_cran_downloads %>%
  mutate(
    Monthly_MA = slide2_dfr(
      .x = .,
      .y = count,
      ~ anomalize::decompose_twitter,
      .size = 5
    )
  )

Error: Element 1 has length 3, not 1 or 425. Callrlang::last_error()to see a backtrace

也许我误解了 .x .y 语法的工作原理?

尝试 2:pmap

my_diag <- function(...) {
  data <- tibble(...)
  fit <- anomalize::decompose_twitter(data = data, target = count)
}

tidyverse_cran_downloads %>%
  nest(-package) %>%
  filter(package %in% c("tidyr", "lubridate")) %>%  # just to make it quick
  mutate(diag = purrr::map(data, ~ pslide_dfr(., my_diag, .size = 7)))

Error in stats::stl(., s.window = "periodic", robust = TRUE) : series is not periodic or has less than two periods

似乎是 运行,但观察之间的时间间隔不知何故或未被解析?

尝试 3:ad_ts

ad_ts只接受一个参数,所以忽略我们还没有找到计算分解后余数的方法这一事实,我应该可以通过slide来使用它。它还期望它的 x 是:

Time series as a two column data frame where the first column consists of the timestamps and the second column consists of the observations.

所以我们不必在数据嵌套后对其做太多操作。

tidyverse_cran_downloads %>%
  nest(-package, .key = "my_data") %>%
  mutate(
    Daily_MA = slide_dfr(
      .f = AnomalyDetection::ad_ts,
      .x = my_data
    )
  )

Error in .f(.x[[i]], ...) : data must be a single data frame.

所以函数至少被调用了,但它被不止一个数据帧调用了?

我想:

我的数据集唯一不同的地方是,我在多个月期间半​​小时观察值,并且我实际上只需要每天重新计算异常(即每 48 次观察一次),其中 window 回顾之前的 30 天 来分解和检测它们。

(N.B。我会标记 tsibbleanomalize,但我没有制作这些标记的代表)

方法 2 应该按预期工作?该错误消息与 stl() 相关,需要至少两个季节周期才能进行估算。例如,每日数据需要至少 14 个观测值 stl() 到 运行。增加 window 大小 .size = 7 * 3 效果很好。

my_decomp <- function(...) {
  data <- tibble(...)
  anomalize::decompose_twitter(data, count)
}

library(dplyr)
library(anomalize)
tidyverse_cran_downloads %>%
  group_by(package) %>% 
  tidyr::nest() %>% 
  mutate(diag = purrr::map(data, ~ tsibble::pslide_dfr(., my_decomp, .size = 7 * 3)))
#> # A tibble: 15 x 3
#>    package   data               diag                
#>    <chr>     <list>             <list>              
#>  1 tidyr     <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  2 lubridate <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  3 dplyr     <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  4 broom     <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  5 tidyquant <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  6 tidytext  <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  7 ggplot2   <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  8 purrr     <tibble [425 × 2]> <tibble [8,506 × 5]>
#>  9 glue      <tibble [425 × 2]> <tibble [8,506 × 5]>
#> 10 stringr   <tibble [425 × 2]> <tibble [8,506 × 5]>
#> 11 forcats   <tibble [425 × 2]> <tibble [8,506 × 5]>
#> 12 knitr     <tibble [425 × 2]> <tibble [8,506 × 5]>
#> 13 readr     <tibble [425 × 2]> <tibble [8,506 × 5]>
#> 14 tibble    <tibble [425 × 2]> <tibble [8,506 × 5]>
#> 15 tidyverse <tibble [425 × 2]> <tibble [8,506 × 5]>