dplyr-- 如何聚合 window 最后或未来的 N 项?
dplyr-- how to do aggregate window of last or future N items?
我有以下数据:
> str(eth)
'data.frame': 11029 obs. of 10 variables:
$ Date : Date, format: "2017-07-01" "2017-07-01" "2017-07-01" "2017-07-01" ...
$ Symbol : Factor w/ 1 level "ETHUSD": 1 1 1 1 1 1 1 1 1 1 ...
$ Open : num 264 257 258 258 260 ...
$ High : num 265 264 261 262 261 ...
$ Low : num 260 256 254 257 253 ...
$ Close : num 263 264 257 258 258 ...
$ Volume.From : num 7221 4975 10747 9118 15402 ...
$ Volume.To : num 1902503 1290128 2765561 2366698 3962669 ...
$ future_12h_high: num NA NA NA NA NA NA NA NA NA NA ...
$ past_12h_high : num NA NA NA NA NA NA NA NA NA NA ...
并且:
> head(eth)
Date Symbol Open High Low Close Volume.From Volume.To future_12h_high past_12h_high
1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503 NA NA
2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128 NA NA
3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561 NA NA
4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698 NA NA
5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669 NA NA
6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852 NA NA
并且:
> dput(head(eth,10))
structure(list(Date = structure(c(17348, 17348, 17348, 17348,
17348, 17348, 17348, 17348, 17348, 17348), class = "Date"), Symbol = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ETHUSD", class = "factor"),
Open = c(263.84, 257.13, 258.17, 258.49, 259.5, 263.51, 268,
272.57, 265.74, 268.79), High = c(264.97, 264.36, 260.56,
262, 260.88, 266.73, 268.44, 272.57, 272.74, 269.9), Low = c(260.31,
256.03, 254.15, 257.12, 253.23, 255.27, 262.39, 267.6, 265,
265), Close = c(263.12, 263.84, 257.13, 258.17, 258.49, 259.5,
263.51, 268, 272.57, 265.74), Volume.From = c(7221.08, 4975.12,
10746.6, 9118.43, 15402.48, 20821.39, 7142.36, 4776.58, 5581.66,
6367.05), Volume.To = c(1902503.11, 1290127.76, 2765560.88,
2366698.5, 3962669, 5396852.35, 1894983.33, 1287300.75, 1500282.55,
1702536.85), future_12h_high = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), past_12h_high = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_)), row.names = c(NA, 10L), class = "data.frame")
我正在尝试计算过去 12 个数据点内的过去和未来高点。
eth <- read.csv("/Users/micahsmith/Downloads/Gdax_ETHUSD_1h.csv") %>%
mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(future_12h_high = max(lead(High,12)), past_12h_high = max(lag(High,12)))
上面的代码不正确 -- 它正在计算当前项目和 [item - 12] 索引项目的最大值。我想使用 all 中的最后 12 个和所有未来的 12 个项目。
总结
我如何 window 使用一系列未来和过去的项目,而不仅仅是一个值?
我已将 window 从 12 降低到 4,因为您的示例数据没有 12 个观察值,但这是使用 RcppRoll
:
的方法
# install.packages("RcppRoll")
library(RcppRoll)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
eth %>% mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(future_12h_high = roll_maxl(High, 4),
past_12h_high = roll_maxr(High, 4))
#> Date Symbol Open High Low Close Volume.From Volume.To
#> 1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503
#> 2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128
#> 3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561
#> 4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698
#> 5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669
#> 6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852
#> 7 2017-07-01 ETHUSD 268.00 268.44 262.39 263.51 7142.36 1894983
#> 8 2017-07-01 ETHUSD 272.57 272.57 267.60 268.00 4776.58 1287301
#> 9 2017-07-01 ETHUSD 265.74 272.74 265.00 272.57 5581.66 1500283
#> 10 2017-07-01 ETHUSD 268.79 269.90 265.00 265.74 6367.05 1702537
#> future_12h_high past_12h_high
#> 1 264.97 NA
#> 2 264.36 NA
#> 3 266.73 NA
#> 4 268.44 264.97
#> 5 272.57 264.36
#> 6 272.74 266.73
#> 7 272.74 268.44
#> 8 NA 272.57
#> 9 NA 272.74
#> 10 NA 272.74
如 dplyr
Window functions vignette、"Rolling aggregates operate in a fixed width window. You won’t find them in base R or in dplyr, but there are many implementations in other packages, such as RcppRoll."
中所述
roll_maxr(High, 4)
在右对齐的长度为 4 的滚动 window 中获得 High
的最大值(因此三个观察 before当前观察和当前观察)。 roll_maxl(High, 4)
做同样的事情,但是左对齐(所以三个观察在当前观察和当前观察之后)。
与@duckmayr post 类似,您还可以使用 zoo
库中的 rollapplyr
:
df %>%
mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(past_12h_high = rollapplyr(High, 4, max, align = "right", partial = FALSE, fill = NA),
future_12h_high = rollapplyr(High, 4, max, align = "left", partial = FALSE, fill = NA))
Date Symbol Open High Low Close Volume.From Volume.To
1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503
2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128
3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561
4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698
5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669
6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852
7 2017-07-01 ETHUSD 268.00 268.44 262.39 263.51 7142.36 1894983
8 2017-07-01 ETHUSD 272.57 272.57 267.60 268.00 4776.58 1287301
9 2017-07-01 ETHUSD 265.74 272.74 265.00 272.57 5581.66 1500283
10 2017-07-01 ETHUSD 268.79 269.90 265.00 265.74 6367.05 1702537
future_12h_high past_12h_high
1 264.97 NA
2 264.36 NA
3 266.73 NA
4 268.44 264.97
5 272.57 264.36
6 272.74 266.73
7 272.74 268.44
8 NA 272.57
9 NA 272.74
10 NA 272.74
此外,对于 partial = TRUE
,rollapplyr
将 return 一个值(而不是 NA(s)),即使 window 大小小于指定值:
df %>%
mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(past_12h_high = rollapplyr(High, 4, max, align = "right", partial = TRUE, fill = NA),
future_12h_high = rollapplyr(High, 4, max, align = "left", partial = TRUE, fill = NA))
Date Symbol Open High Low Close Volume.From Volume.To
1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503
2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128
3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561
4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698
5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669
6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852
7 2017-07-01 ETHUSD 268.00 268.44 262.39 263.51 7142.36 1894983
8 2017-07-01 ETHUSD 272.57 272.57 267.60 268.00 4776.58 1287301
9 2017-07-01 ETHUSD 265.74 272.74 265.00 272.57 5581.66 1500283
10 2017-07-01 ETHUSD 268.79 269.90 265.00 265.74 6367.05 1702537
future_12h_high past_12h_high
1 264.97 264.97
2 264.36 264.97
3 266.73 264.97
4 268.44 264.97
5 272.57 264.36
6 272.74 266.73
7 272.74 268.44
8 272.74 272.57
9 272.74 272.74
10 269.90 272.74
我有以下数据:
> str(eth)
'data.frame': 11029 obs. of 10 variables:
$ Date : Date, format: "2017-07-01" "2017-07-01" "2017-07-01" "2017-07-01" ...
$ Symbol : Factor w/ 1 level "ETHUSD": 1 1 1 1 1 1 1 1 1 1 ...
$ Open : num 264 257 258 258 260 ...
$ High : num 265 264 261 262 261 ...
$ Low : num 260 256 254 257 253 ...
$ Close : num 263 264 257 258 258 ...
$ Volume.From : num 7221 4975 10747 9118 15402 ...
$ Volume.To : num 1902503 1290128 2765561 2366698 3962669 ...
$ future_12h_high: num NA NA NA NA NA NA NA NA NA NA ...
$ past_12h_high : num NA NA NA NA NA NA NA NA NA NA ...
并且:
> head(eth)
Date Symbol Open High Low Close Volume.From Volume.To future_12h_high past_12h_high
1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503 NA NA
2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128 NA NA
3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561 NA NA
4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698 NA NA
5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669 NA NA
6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852 NA NA
并且:
> dput(head(eth,10))
structure(list(Date = structure(c(17348, 17348, 17348, 17348,
17348, 17348, 17348, 17348, 17348, 17348), class = "Date"), Symbol = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ETHUSD", class = "factor"),
Open = c(263.84, 257.13, 258.17, 258.49, 259.5, 263.51, 268,
272.57, 265.74, 268.79), High = c(264.97, 264.36, 260.56,
262, 260.88, 266.73, 268.44, 272.57, 272.74, 269.9), Low = c(260.31,
256.03, 254.15, 257.12, 253.23, 255.27, 262.39, 267.6, 265,
265), Close = c(263.12, 263.84, 257.13, 258.17, 258.49, 259.5,
263.51, 268, 272.57, 265.74), Volume.From = c(7221.08, 4975.12,
10746.6, 9118.43, 15402.48, 20821.39, 7142.36, 4776.58, 5581.66,
6367.05), Volume.To = c(1902503.11, 1290127.76, 2765560.88,
2366698.5, 3962669, 5396852.35, 1894983.33, 1287300.75, 1500282.55,
1702536.85), future_12h_high = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), past_12h_high = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_)), row.names = c(NA, 10L), class = "data.frame")
我正在尝试计算过去 12 个数据点内的过去和未来高点。
eth <- read.csv("/Users/micahsmith/Downloads/Gdax_ETHUSD_1h.csv") %>%
mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(future_12h_high = max(lead(High,12)), past_12h_high = max(lag(High,12)))
上面的代码不正确 -- 它正在计算当前项目和 [item - 12] 索引项目的最大值。我想使用 all 中的最后 12 个和所有未来的 12 个项目。
总结
我如何 window 使用一系列未来和过去的项目,而不仅仅是一个值?
我已将 window 从 12 降低到 4,因为您的示例数据没有 12 个观察值,但这是使用 RcppRoll
:
# install.packages("RcppRoll")
library(RcppRoll)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
eth %>% mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(future_12h_high = roll_maxl(High, 4),
past_12h_high = roll_maxr(High, 4))
#> Date Symbol Open High Low Close Volume.From Volume.To
#> 1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503
#> 2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128
#> 3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561
#> 4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698
#> 5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669
#> 6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852
#> 7 2017-07-01 ETHUSD 268.00 268.44 262.39 263.51 7142.36 1894983
#> 8 2017-07-01 ETHUSD 272.57 272.57 267.60 268.00 4776.58 1287301
#> 9 2017-07-01 ETHUSD 265.74 272.74 265.00 272.57 5581.66 1500283
#> 10 2017-07-01 ETHUSD 268.79 269.90 265.00 265.74 6367.05 1702537
#> future_12h_high past_12h_high
#> 1 264.97 NA
#> 2 264.36 NA
#> 3 266.73 NA
#> 4 268.44 264.97
#> 5 272.57 264.36
#> 6 272.74 266.73
#> 7 272.74 268.44
#> 8 NA 272.57
#> 9 NA 272.74
#> 10 NA 272.74
如 dplyr
Window functions vignette、"Rolling aggregates operate in a fixed width window. You won’t find them in base R or in dplyr, but there are many implementations in other packages, such as RcppRoll."
roll_maxr(High, 4)
在右对齐的长度为 4 的滚动 window 中获得 High
的最大值(因此三个观察 before当前观察和当前观察)。 roll_maxl(High, 4)
做同样的事情,但是左对齐(所以三个观察在当前观察和当前观察之后)。
与@duckmayr post 类似,您还可以使用 zoo
库中的 rollapplyr
:
df %>%
mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(past_12h_high = rollapplyr(High, 4, max, align = "right", partial = FALSE, fill = NA),
future_12h_high = rollapplyr(High, 4, max, align = "left", partial = FALSE, fill = NA))
Date Symbol Open High Low Close Volume.From Volume.To
1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503
2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128
3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561
4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698
5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669
6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852
7 2017-07-01 ETHUSD 268.00 268.44 262.39 263.51 7142.36 1894983
8 2017-07-01 ETHUSD 272.57 272.57 267.60 268.00 4776.58 1287301
9 2017-07-01 ETHUSD 265.74 272.74 265.00 272.57 5581.66 1500283
10 2017-07-01 ETHUSD 268.79 269.90 265.00 265.74 6367.05 1702537
future_12h_high past_12h_high
1 264.97 NA
2 264.36 NA
3 266.73 NA
4 268.44 264.97
5 272.57 264.36
6 272.74 266.73
7 272.74 268.44
8 NA 272.57
9 NA 272.74
10 NA 272.74
此外,对于 partial = TRUE
,rollapplyr
将 return 一个值(而不是 NA(s)),即使 window 大小小于指定值:
df %>%
mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
arrange(Date) %>%
mutate(past_12h_high = rollapplyr(High, 4, max, align = "right", partial = TRUE, fill = NA),
future_12h_high = rollapplyr(High, 4, max, align = "left", partial = TRUE, fill = NA))
Date Symbol Open High Low Close Volume.From Volume.To
1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12 7221.08 1902503
2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84 4975.12 1290128
3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13 10746.60 2765561
4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17 9118.43 2366698
5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49 15402.48 3962669
6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50 20821.39 5396852
7 2017-07-01 ETHUSD 268.00 268.44 262.39 263.51 7142.36 1894983
8 2017-07-01 ETHUSD 272.57 272.57 267.60 268.00 4776.58 1287301
9 2017-07-01 ETHUSD 265.74 272.74 265.00 272.57 5581.66 1500283
10 2017-07-01 ETHUSD 268.79 269.90 265.00 265.74 6367.05 1702537
future_12h_high past_12h_high
1 264.97 264.97
2 264.36 264.97
3 266.73 264.97
4 268.44 264.97
5 272.57 264.36
6 272.74 266.73
7 272.74 268.44
8 272.74 272.57
9 272.74 272.74
10 269.90 272.74