参考当前日期之前的最后 365 天数据应用计算

Question

我有一个 data.frame 看起来像这样：

ACCOUNT  POSTING_DT  WA  Amount
  10019   1/10/2006  19    99.1
  10019   6/18/2007  15   318.5
  10019    7/2/2007  12 23005.1
  10019   3/25/2008  15 16866.3
  10019   9/22/2008  -1 16902.3
  10121   4/18/2006   1 28029.9
  10121   5/28/2006   3   16528
  10121   3/20/2007   1 41730.1

每个账户都有不同的过帐日期，而且这些日期不是连续的。我想使用当前过帐日期前 365 天的项目应用计算 sum(WA*Amount)/sum(Amount)。

例如对于帐户 10019，对于 3/25/2008 项目，我想使用 6/18/2007 和 7/2/2007 项目应用该计算，这将是 (15*318.5+12*23005.1)/(318.5+23005.1).

R 中有这样做的函数吗？

Answer 1

我不知道有哪个功能可以满足您的需求；坦率地说，我有点惊讶，因为它有点 "niche"，但这是可能的。

您的数据：

txt <- 'ACCOUNT  POSTING_DT  WA  Amount
  10019   1/10/2006  19    99.1
  10019   6/18/2007  15   318.5
  10019    7/2/2007  12 23005.1
  10019   3/25/2008  15 16866.3
  10019   9/22/2008  -1 16902.3
  10121   4/18/2006   1 28029.9
  10121   5/28/2006   3   16528
  10121   3/20/2007   1 41730.1'
dat <- read.table(text=txt, header=TRUE, stringsAsFactors=FALSE)

日期列确实应该作为 Date 对象而不是字符串，所以 ...

dat$POSTING_DT <- as.Date(dat$POSTING_DT, format='%m/%d/%Y')

这里是：

dat$NewAmount <- sapply(1:nrow(dat), function(r) {
    d <- (dat$POSTING_DT[r] - dat$POSTING_DT)
    ## I use both (d>=0) and idx[r] <- FALSE so that if there are multiple
    ## instances on a day, the other ones will still be included
    idx <- (dat$ACCOUNT == dat$ACCOUNT[r]) & (d >= 0) & (d <= 365)
    idx[r] <- FALSE
    ## the crux of this function ("with" is not required but it reads well)
    with(dat, sum(WA[idx] * Amount[idx]) / sum(Amount[idx]))
})
dat
##   ACCOUNT POSTING_DT WA  Amount NewAmount
## 1   10019 2006-01-10 19    99.1       NaN
## 2   10019 2007-06-18 15   318.5       NaN
## 3   10019 2007-07-02 12 23005.1 15.000000
## 4   10019 2008-03-25 15 16866.3 12.040967
## 5   10019 2008-09-22 -1 16902.3 15.000000
## 6   10121 2006-04-18  1 28029.9       NaN
## 7   10121 2006-05-28  3 16528.0  1.000000
## 8   10121 2007-03-20  1 41730.1  1.741866

你没有说空集应该发生什么。如果你需要将它们设置为零，你可以这样做：

dat$NewAmount <- pmax(dat$NewAmount, 0, na.rm=TRUE)
dat
##   ACCOUNT POSTING_DT WA  Amount NewAmount
## 1   10019 2006-01-10 19    99.1  0.000000
## 2   10019 2007-06-18 15   318.5  0.000000
## 3   10019 2007-07-02 12 23005.1 15.000000
## 4   10019 2008-03-25 15 16866.3 12.040967
## 5   10019 2008-09-22 -1 16902.3 15.000000
## 6   10121 2006-04-18  1 28029.9  0.000000
## 7   10121 2006-05-28  3 16528.0  1.000000
## 8   10121 2007-03-20  1 41730.1  1.741866

编辑：未测试

有很多行（如您所说的 157k），假设每个帐户有足够多的行，您可能会从首先分组中获益。这可以通过基本函数 (split?) 来完成，但我将演示 dplyr:

library(dplyr)
dat %>%
    group_by(ACCOUNT) %>%
    mutate(NewAmount = sapply(1:n(), function(r) {
        d <- (dat$POSTING_DT[r] - dat$POSTING_DT)
        idx <- (dat$ACCOUNT == dat$ACCOUNT[r]) & (d >= 0) & (d <= 365)
        idx[r] <- FALSE
        with(dat, sum(WA[idx] * Amount[idx]) / sum(Amount[idx]))
    }))

可能有一种更优雅的 dplyr 风格的方法，但在一个分组中做事 row-wise 对我来说似乎并不令人尖叫 "easy efficiency"。

编辑 2:

缺少dplyr（甚至devtools！？！），试试这个base-版本：

do.call("rbind", lapply(split(dat, dat$ACCOUNT), function(x) {
    x$NewAmount <- sapply(1:nrow(x), function(r) {
        d <- (x$POSTING_DT[r] - x$POSTING_DT)
        idx <- (x$ACCOUNT == x$ACCOUNT[r]) & (d >= 0) & (d <= 365)
        idx[r] <- FALSE
        with(x, sum(WA[idx] * Amount[idx]) / sum(Amount[idx]))
    }, USE.NAMES=FALSE)
    x
}))

我没有做过基准测试或广泛的测试，所以买者自负。

参考当前日期之前的最后 365 天数据应用计算

Apply calculation referring to last 365 days data prior to current date

r

time-series