在日期向量上使用 sapply：功能非常慢。为什么？

Question

我有一个非常简单的函数，它接受一个 POSIXct 日期，提取年份，如果日期早于 6 月 1 日则减去 1。

library(lubridate)
DetermineWaterYear <- function(date, 
                               return.interval=FALSE){
  wy <- year(date) + ifelse(month(date)>=6, 0, -1)
  if(return.interval==FALSE){
    return(wy)
  } else {
    interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
    return(interval)
  }
}

当我尝试使用 sapply() 对约 190k 日期的向量执行此函数时，它需要永远。

sapply(temp$date, DetermineWaterYear)

此外，我使用以下代码对长度为 10000 到 190000 的向量子集执行 sapply 计时：

tempdates <- rep(ymd('1956-01-01'), 190000)


index <- seq(10000,190000,10000)
for(i in 1:length(index)){
  times[i] <- system.time(sapply(tempdates[1:index[i]], DetermineWaterYear))[3]
}

疯狂的是，随着日期向量变长，每条记录的处理时间大大增加……处理 190k 个日期所需的时间是 10k 个日期所需时间的 238 倍。我有足够的可用内存。

为什么运行如此缓慢？我该如何优化它？

Answer 1

正如评论中所指出的，将日期向量直接传递给函数要快得多。此外，ifelse 有大量开销，因此将 ifelse(month(date)>=6, 0, -1) 替换为 floor((x/5.6) - (x^2)*0.001) - 1L 会快得多。

DetermineWaterYearNew <- function(date, return.interval=FALSE){
    x <- month(date)
    wy <- year(date) + floor((x/5.6) - (x^2)*0.001) - 1L
    if(return.interval==FALSE){
        return(wy)
    } else {
        interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
        return(interval)
    }
}

这里有一些基准：

microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:1000]),
               OldVectorized=DetermineWaterYear(tempdates[1:1000]),
               NonVectorized=sapply(tempdates[1:1000],DetermineWaterYear))
Unit: microseconds
         expr       min         lq       mean     median         uq       max neval
NewVectorized   341.954   364.1215   418.7311   395.7300   460.7955   602.627   100
OldVectorized   417.077   437.3970   496.0585   462.8485   545.1555   802.954   100
NonVectorized 42601.719 45148.3070 46452.6843 45902.4100 47341.2415 62898.476   100

仅在我们拥有的所有日期范围内比较矢量化解决方案：

microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:190000]),
               OldVectorized=DetermineWaterYear(tempdates[1:190000]))
Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
NewVectorized 26.30660 27.26575 28.97715 27.84169 29.19391 102.1697   100
OldVectorized 38.98637 40.78153 44.07461 42.55287 43.77947 114.9616   100

在日期向量上使用 sapply：功能非常慢。为什么？

Using sapply on a vector of dates: Function very slow. Why?

r

sapply

lubridate