Rcpp 中的滚动求和函数

Question

我目前正在处理一个大型数据框，必须为多个变量创建多个长度的滚动总和。我有一个通过 data.table 的工作方法，但是通过单个变量运行需要相当长的时间（每个变量大约 50 分钟）。

我花了一些时间改进脚本以加快速度，但运行没有想法。我没有使用 C++ 的经验，但认为 Rcpp 包可能是一个选项。我自己研究过，但没能想出任何可用的东西。

这是我的 data.table 一个变量的脚本

df_td <- setDT(df_1, key=c("Match","Name"))[,by=.(Match, Name), paste0("Period_", 1:10) 
                                        := mclapply((1:10)*600, function(x) rollsumr(Dist, x, fill = NA))][]

我已经使用了 parallel::mclapply，这很有帮助，但它仍然需要很多时间才能工作。

> dput(head(df_1, 20))
structure(list(Match = c("Bath_A", "Bath_A", "Bath_A", "Bath_A", 
"Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", 
"Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", 
"Bath_A", "Bath_A"), Name = c("Jono Lance", "Jono Lance", "Jono     Lance", 
"Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", 
"Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", 
"Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", 
"Jono Lance", "Jono Lance"), Dist = c(0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Dist_HS = c(0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Dist_SD = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names =    c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

> str(df_1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   26533771 obs. of  5     variables:
$ Match  : chr  "Bath_A" "Bath_A" "Bath_A" "Bath_A" ...
$ Name   : chr  "Jono Lance" "Jono Lance" "Jono Lance" "Jono Lance"   ...
$ Dist   : num  0 0 0 0 0 0 0 0 0 0 ...
$ Dist_HS: num  0 0 0 0 0 0 0 0 0 0 ...
$ Dist_SD: num  0 0 0 0 0 0 0 0 0 0 ...

如有任何加快速度的建议，我们将不胜感激

Answer 1

我可能已经找到解决问题的方法 here。通过从 Rcpp

添加以下函数

cppFunction('
NumericVector run_sum_v2(NumericVector x, int n) {

        int sz = x.size();

        NumericVector res(sz);

        // sum the values from the beginning of the vector to n 
        res[n-1] = std::accumulate(x.begin(), x.end()-sz+n, 0.0);

        // loop through the rest of the vector
        for(int i = n; i < sz; i++) {
        res[i] = res[i-1] + x[i] - x[i-n];
        }

        // pad the first n-1 elements with NA
        std::fill(res.begin(), res.end()-sz+n-1, NA_REAL);

        return res;
        }
        ')

run_sum_v2 代替 zoo:rollsumr 适合我的 data.table 行并且似乎快得多（<1 分钟）。需要对最终数据进行一些检查，但到目前为止看起来很有希望。

已将 2 小时以上的脚本缩短到 <20 秒，所以我对这种方法很满意，除非它有任何问题吗？

Answer 2

由于存在重叠的总和，您可以重复使用之前迭代的总和。这是使用 shift

的可能方法

library(RcppRoll)
DT[, Period_1 := roll_sumr(Dist, 600L, fill=NA), by=.(ID)]
for (n in 2L:10L) {
    DT[, paste0("Period_", n) := {
            x <- get(paste0("Period_", n-1L))
            shift(x, 600L) + Period_1
        },
        by=.(ID)]
}

使用Reduce替换循环：

library(RcppRoll)
DT[, Period_1 := roll_sumr(Dist, 600L, fill=NA), by=.(ID)]
DT[, paste0("Period_", 1L:10L) :=
    Reduce(function(x, y) x + y, shift(Period_1, (1L:9L)*600L), Period_1, accum=TRUE),
    by=.(ID)]

数据：

library(data.table)
set.seed(0L)
nsampl <- 6003
nIDs <- 1
DT <- data.table(ID=rep(1:nIDs, each=nsampl), 
    Dist=rnorm(nIDs*nsampl, 1000, 100))

Rcpp 中的滚动求和函数

Rolling sum function in Rcpp

r

zoo

rcpp

data.table