为什么我的 lubridate 日期函数这么慢?
Why are my functions on lubridate dates so slow?
我写了这个我一直在用的函数:
# Give the previous day, or Friday if the previous day is Saturday or Sunday.
previous_business_date_if_weekend = function(my_date) {
if (length(my_date) == 1) {
if (weekdays(my_date) == "Sunday") { my_date = lubridate::as_date(my_date) - 2 }
if (weekdays(my_date) == "Saturday") { my_date = lubridate::as_date(my_date) - 1 }
return(lubridate::as_date(my_date))
} else if (length(my_date) > 1) {
my_date = lubridate::as_date(sapply(my_date, previous_business_date_if_weekend))
return(my_date)
}
}
当我将它应用于具有数千行的数据框的日期列时出现问题。这太慢了。 有什么想法吗?
您正在遍历每一行。它很慢也就不足为奇了。您基本上可以执行一个替换操作,而不是从每个日期取固定差异:0 表示 M-F,-1 表示星期六,-2 表示星期日。
# 'big' sample data
x <- Sys.Date() + 0:100000
bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)
# since `weekdays()` is locale-specific, you could also be defensive and do:
bizdays <- function(x) x - match(format(x, "%w"), c("6","0"), nomatch=0)
system.time(bizdays(x))
# user system elapsed
# 0.36 0.00 0.35
system.time(previous_business_date_if_weekend(x))
# user system elapsed
# 45.45 0.00 45.57
identical(bizdays(x), previous_business_date_if_weekend(x))
#[1] TRUE
根据我的经验,Lubridate 有点慢。我建议使用 data.table 和 iDate。
像这样的东西应该非常健壮:
library(data.table)
#Make data.table of dates in string format
x = data.table(date = format(Sys.Date() + 0:100000,format='%d/%m/%Y'))
#Convert to IDate (by reference)
set(x, j = "date", value = as.IDate(strptime(x[,date], "%d/%m/%Y")))
#Day zero was a Thursday
originDate = as.IDate(strptime("01/01/1970", "%d/%m/%Y"))
as.integer(originDate)
#[1] 0
weekdays(originDate)
#[1] "Thursday"
previous_business_date_if_weekend_dt = function(x) {
#Adjust dates so that Sat is 1, Sun is 2, and subtract by reference
x[,adjustedDate := date]
x[(as.integer(x[,date]-2) %% 7 + 1)<=2, adjustedDate := adjustedDate - (as.integer(date-2) %% 7 + 1)]
}
bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)
system.time(bizdays(y))
# user system elapsed
# 0.22 0.00 0.22
system.time(previous_business_date_if_weekend_dt(x))
# user system elapsed
# 0 0 0
另请注意,在此解决方案中花费最多时间的部分可能是从字符串中提取日期,如果您担心的话,可以将它们重新格式化为整数格式。
只是为了增加另一种可能性:纯 R 实现在 datetimetutils
包中(我是作者)。函数 previous_businessday
转换为 POSIXlt
以提取工作日。
(代码将函数的结果与 thelatemail 建议的函数 bizdays
进行比较。)
library("datetimeutils")
x <- Sys.Date() + 0:100000
system.time(bizdays(x))
## user system elapsed
## 0.25 0.00 0.25
system.time(previous_businessday(x, shift = 0))
## user system elapsed
## 0.03 0.00 0.03
identical(bizdays(x), previous_businessday(x, shift = 0))
## TRUE
previous_businessday
的略微简化版本如下所示;它假设 x
属于 class Date
.
previous_bd <- function(x) {
tmp <- as.POSIXlt(x)
tmpi <- tmp$wday == 6L
x[tmpi] <- x[tmpi] - 1L
tmpi <- tmp$wday == 0L
x[tmpi] <- x[tmpi] - 2L
x
}
system.time(previous_bd(x))
## user system elapsed
## 0.03 0.00 0.03
identical(bizdays(x), previous_bd(x))
## TRUE
OP 的问题 and some generalizing statements like 表明特定的软件包可能是性能低下的原因。
我想用一些基准来验证这一点。
使用双冒号运算符的惩罚 ::
使用双冒号运算符 ::
访问命名空间中导出的变量或函数会受到惩罚。
# creating data
n <- 10^1L
fmt <- "%F"
chr_dates <- format(Sys.Date() + seq_len(n), "%F")
# loading lubridate into namespace
library(lubridate)
microbenchmark::microbenchmark(
base1 = r1 <- as.Date(chr_dates),
base2 = r2 <- base::as.Date(chr_dates),
lubr1 = r3 <- as_date(chr_dates),
lubr2 = r4 <- lubridate::as_date(chr_dates),
times = 100L
)
Unit: microseconds
expr min lq mean median uq max neval cld
base1 87.977 89.1100 92.03587 89.865 90.9980 128.756 100 a
base2 94.018 95.7175 100.64848 97.039 99.3045 179.351 100 b
lubr1 92.508 94.2070 98.21307 95.151 97.7940 175.954 100 b
lubr2 101.569 103.0800 109.98974 104.024 107.9885 258.643 100 c
使用双冒号运算符 ::
的代价是大约 10 微秒。
这仅在函数被重复调用时才有意义(因为它发生在使用 sapply()
的 OP 代码中)。恕我直言,调试命名空间冲突或维护函数来源不明的代码的痛苦要高得多。当然,您的里程可能会有所不同。
可以验证 n = 100
、
的计时
Unit: microseconds
expr min lq mean median uq max neval cld
base1 556.933 561.0855 580.3382 562.9730 590.7250 812.176 100 a
base2 564.483 568.2600 588.5695 570.9030 596.2010 989.262 100 a
lubr1 562.596 565.9935 587.4443 568.4480 594.8790 1039.480 100 a
lubr2 572.036 575.9995 597.1557 578.4545 601.1085 1230.159 100 a
将字符日期转换为 class 日期
有许多包处理将不同格式的字符日期转换为 class Date
或 POSIXct
。其中一些以性能为目标,另一些以方便为目标。
这里比较了base
、lubridate
、anytime
、fasttime
、data.table
(因为其中一个回答里提到了) .
输入是标准明确格式的字符日期 YYYY-MM-DD
。忽略时区。
fasttime
只接受 1970 到 2199 之间的日期,因此必须修改示例数据的创建以创建包含 10 万个日期的示例数据集。
n <- 10^5L
fmt <- "%F"
set.seed(123L)
chr_dates <- format(
sample(
seq(as.Date("1970-01-01"), as.Date("2199-12-31"), by = 1L),
n, replace = TRUE),
"%F")
因为 猜测格式可能会增加惩罚,因此在可能的情况下使用或不使用给定格式调用函数。使用双冒号运算符 ::
.
调用所有函数
microbenchmark::microbenchmark(
base_ = r1 <- base::as.Date(chr_dates),
basef = r1 <- base::as.Date(chr_dates, fmt),
lub1_ = r2 <- lubridate::as_date(chr_dates),
lub1f = r2 <- lubridate::as_date(chr_dates, fmt),
lub2_ = r3 <- lubridate::ymd(chr_dates),
anyt_ = r4 <- anytime::anydate(chr_dates),
idat_ = r5 <- data.table::as.IDate(chr_dates),
idatf = r5 <- data.table::as.IDate(chr_dates, fmt),
fast_ = r6 <- fasttime::fastPOSIXct(chr_dates),
fastd = r6 <- as.Date(fasttime::fastPOSIXct(chr_dates)),
times = 5L
)
# check results
all.equal(r1, r2)
all.equal(r1, r3)
all.equal(r1, c(r4)) # remove tzone attribute
all.equal(r1, as.Date(r5)) # convert IDate to Date
all.equal(r1, as.Date(r6)) # convert POSIXct to Date
Unit: milliseconds
expr min lq mean median uq max neval cld
base_ 641.799082 645.008517 648.128466 648.791875 649.149444 655.893411 5 d
basef 69.377419 69.937371 73.888828 71.403139 76.022083 82.704127 5 b
lub1_ 644.199361 645.217696 680.542327 649.855896 652.887492 810.551189 5 d
lub1f 69.769726 69.947943 70.944605 70.795234 71.365759 72.844364 5 b
lub2_ 18.672495 27.025711 26.990218 28.180730 29.944409 31.127747 5 ab
anyt_ 381.870316 384.513758 386.211134 384.992152 385.159043 394.520400 5 c
idat_ 643.386808 644.312259 649.385356 648.204359 651.666396 659.356958 5 d
idatf 69.844109 71.188673 75.319481 77.142365 78.156923 80.265334 5 b
fast_ 4.994637 5.363533 5.748137 5.601031 5.760370 7.021112 5 a
fastd 5.230625 6.296157 6.686500 6.345998 6.538941 9.020780 5 a
时间显示
- 弗兰克的怀疑是正确的。猜测格式是昂贵的。将格式作为参数传递给
as.Date()
、as_date()
和 as.IDate()
比没有调用快十倍。
fasttime::fastPOSIXct()
确实是最快的。即使从 POSIXct
到 Date
进行额外的转换,它也比第二快的 lubridate::ymd()
. 快四倍
我写了这个我一直在用的函数:
# Give the previous day, or Friday if the previous day is Saturday or Sunday.
previous_business_date_if_weekend = function(my_date) {
if (length(my_date) == 1) {
if (weekdays(my_date) == "Sunday") { my_date = lubridate::as_date(my_date) - 2 }
if (weekdays(my_date) == "Saturday") { my_date = lubridate::as_date(my_date) - 1 }
return(lubridate::as_date(my_date))
} else if (length(my_date) > 1) {
my_date = lubridate::as_date(sapply(my_date, previous_business_date_if_weekend))
return(my_date)
}
}
当我将它应用于具有数千行的数据框的日期列时出现问题。这太慢了。 有什么想法吗?
您正在遍历每一行。它很慢也就不足为奇了。您基本上可以执行一个替换操作,而不是从每个日期取固定差异:0 表示 M-F,-1 表示星期六,-2 表示星期日。
# 'big' sample data
x <- Sys.Date() + 0:100000
bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)
# since `weekdays()` is locale-specific, you could also be defensive and do:
bizdays <- function(x) x - match(format(x, "%w"), c("6","0"), nomatch=0)
system.time(bizdays(x))
# user system elapsed
# 0.36 0.00 0.35
system.time(previous_business_date_if_weekend(x))
# user system elapsed
# 45.45 0.00 45.57
identical(bizdays(x), previous_business_date_if_weekend(x))
#[1] TRUE
根据我的经验,Lubridate 有点慢。我建议使用 data.table 和 iDate。
像这样的东西应该非常健壮:
library(data.table)
#Make data.table of dates in string format
x = data.table(date = format(Sys.Date() + 0:100000,format='%d/%m/%Y'))
#Convert to IDate (by reference)
set(x, j = "date", value = as.IDate(strptime(x[,date], "%d/%m/%Y")))
#Day zero was a Thursday
originDate = as.IDate(strptime("01/01/1970", "%d/%m/%Y"))
as.integer(originDate)
#[1] 0
weekdays(originDate)
#[1] "Thursday"
previous_business_date_if_weekend_dt = function(x) {
#Adjust dates so that Sat is 1, Sun is 2, and subtract by reference
x[,adjustedDate := date]
x[(as.integer(x[,date]-2) %% 7 + 1)<=2, adjustedDate := adjustedDate - (as.integer(date-2) %% 7 + 1)]
}
bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)
system.time(bizdays(y))
# user system elapsed
# 0.22 0.00 0.22
system.time(previous_business_date_if_weekend_dt(x))
# user system elapsed
# 0 0 0
另请注意,在此解决方案中花费最多时间的部分可能是从字符串中提取日期,如果您担心的话,可以将它们重新格式化为整数格式。
只是为了增加另一种可能性:纯 R 实现在 datetimetutils
包中(我是作者)。函数 previous_businessday
转换为 POSIXlt
以提取工作日。
(代码将函数的结果与 thelatemail 建议的函数 bizdays
进行比较。)
library("datetimeutils")
x <- Sys.Date() + 0:100000
system.time(bizdays(x))
## user system elapsed
## 0.25 0.00 0.25
system.time(previous_businessday(x, shift = 0))
## user system elapsed
## 0.03 0.00 0.03
identical(bizdays(x), previous_businessday(x, shift = 0))
## TRUE
previous_businessday
的略微简化版本如下所示;它假设 x
属于 class Date
.
previous_bd <- function(x) {
tmp <- as.POSIXlt(x)
tmpi <- tmp$wday == 6L
x[tmpi] <- x[tmpi] - 1L
tmpi <- tmp$wday == 0L
x[tmpi] <- x[tmpi] - 2L
x
}
system.time(previous_bd(x))
## user system elapsed
## 0.03 0.00 0.03
identical(bizdays(x), previous_bd(x))
## TRUE
OP 的问题
我想用一些基准来验证这一点。
使用双冒号运算符的惩罚 ::
::
访问命名空间中导出的变量或函数会受到惩罚。
# creating data
n <- 10^1L
fmt <- "%F"
chr_dates <- format(Sys.Date() + seq_len(n), "%F")
# loading lubridate into namespace
library(lubridate)
microbenchmark::microbenchmark(
base1 = r1 <- as.Date(chr_dates),
base2 = r2 <- base::as.Date(chr_dates),
lubr1 = r3 <- as_date(chr_dates),
lubr2 = r4 <- lubridate::as_date(chr_dates),
times = 100L
)
Unit: microseconds expr min lq mean median uq max neval cld base1 87.977 89.1100 92.03587 89.865 90.9980 128.756 100 a base2 94.018 95.7175 100.64848 97.039 99.3045 179.351 100 b lubr1 92.508 94.2070 98.21307 95.151 97.7940 175.954 100 b lubr2 101.569 103.0800 109.98974 104.024 107.9885 258.643 100 c
使用双冒号运算符 ::
的代价是大约 10 微秒。
这仅在函数被重复调用时才有意义(因为它发生在使用 sapply()
的 OP 代码中)。恕我直言,调试命名空间冲突或维护函数来源不明的代码的痛苦要高得多。当然,您的里程可能会有所不同。
可以验证 n = 100
、
Unit: microseconds expr min lq mean median uq max neval cld base1 556.933 561.0855 580.3382 562.9730 590.7250 812.176 100 a base2 564.483 568.2600 588.5695 570.9030 596.2010 989.262 100 a lubr1 562.596 565.9935 587.4443 568.4480 594.8790 1039.480 100 a lubr2 572.036 575.9995 597.1557 578.4545 601.1085 1230.159 100 a
将字符日期转换为 class 日期
有许多包处理将不同格式的字符日期转换为 class Date
或 POSIXct
。其中一些以性能为目标,另一些以方便为目标。
这里比较了base
、lubridate
、anytime
、fasttime
、data.table
(因为其中一个回答里提到了) .
输入是标准明确格式的字符日期 YYYY-MM-DD
。忽略时区。
fasttime
只接受 1970 到 2199 之间的日期,因此必须修改示例数据的创建以创建包含 10 万个日期的示例数据集。
n <- 10^5L
fmt <- "%F"
set.seed(123L)
chr_dates <- format(
sample(
seq(as.Date("1970-01-01"), as.Date("2199-12-31"), by = 1L),
n, replace = TRUE),
"%F")
因为 ::
.
microbenchmark::microbenchmark(
base_ = r1 <- base::as.Date(chr_dates),
basef = r1 <- base::as.Date(chr_dates, fmt),
lub1_ = r2 <- lubridate::as_date(chr_dates),
lub1f = r2 <- lubridate::as_date(chr_dates, fmt),
lub2_ = r3 <- lubridate::ymd(chr_dates),
anyt_ = r4 <- anytime::anydate(chr_dates),
idat_ = r5 <- data.table::as.IDate(chr_dates),
idatf = r5 <- data.table::as.IDate(chr_dates, fmt),
fast_ = r6 <- fasttime::fastPOSIXct(chr_dates),
fastd = r6 <- as.Date(fasttime::fastPOSIXct(chr_dates)),
times = 5L
)
# check results
all.equal(r1, r2)
all.equal(r1, r3)
all.equal(r1, c(r4)) # remove tzone attribute
all.equal(r1, as.Date(r5)) # convert IDate to Date
all.equal(r1, as.Date(r6)) # convert POSIXct to Date
Unit: milliseconds expr min lq mean median uq max neval cld base_ 641.799082 645.008517 648.128466 648.791875 649.149444 655.893411 5 d basef 69.377419 69.937371 73.888828 71.403139 76.022083 82.704127 5 b lub1_ 644.199361 645.217696 680.542327 649.855896 652.887492 810.551189 5 d lub1f 69.769726 69.947943 70.944605 70.795234 71.365759 72.844364 5 b lub2_ 18.672495 27.025711 26.990218 28.180730 29.944409 31.127747 5 ab anyt_ 381.870316 384.513758 386.211134 384.992152 385.159043 394.520400 5 c idat_ 643.386808 644.312259 649.385356 648.204359 651.666396 659.356958 5 d idatf 69.844109 71.188673 75.319481 77.142365 78.156923 80.265334 5 b fast_ 4.994637 5.363533 5.748137 5.601031 5.760370 7.021112 5 a fastd 5.230625 6.296157 6.686500 6.345998 6.538941 9.020780 5 a
时间显示
- 弗兰克的怀疑是正确的。猜测格式是昂贵的。将格式作为参数传递给
as.Date()
、as_date()
和as.IDate()
比没有调用快十倍。 fasttime::fastPOSIXct()
确实是最快的。即使从POSIXct
到Date
进行额外的转换,它也比第二快的lubridate::ymd()
. 快四倍