查找数据集 1 和数据集 2 之间最接近的日期

Find the closest date between dataset1 and dataset2

我有两个数据集。一个大约每 5 天收集一次,另一个每天每 15 分钟收集一次。我想要一个最终列表,它匹配从频率较低的数据集到频率较高的数据集中的条目的最近日期。

例如:

satDat <- c('2015-04-16', '2015-04-21', '2012-04-26') # collected every 5 days

stationDat <- sort(rep(seq(as.Date("2015-04-01"), as.Date("2015-04-20"), by='day'),2)) 
#collected multiple times a day

 [1] "2015-04-01" "2015-04-01" "2015-04-02" "2015-04-02" "2015-04-03"
 [6] "2015-04-03" "2015-04-04" "2015-04-04" "2015-04-05" "2015-04-05"
[11] "2015-04-06" "2015-04-06" "2015-04-07" "2015-04-07" "2015-04-08"
[16] "2015-04-08" "2015-04-09" "2015-04-09" "2015-04-10" "2015-04-10"
[21] "2015-04-11" "2015-04-11" "2015-04-12" "2015-04-12" "2015-04-13"
[26] "2015-04-13" "2015-04-14" "2015-04-14" "2015-04-15" "2015-04-15"
[31] "2015-04-16" "2015-04-16" "2015-04-17" "2015-04-17" "2015-04-18"
[36] "2015-04-18" "2015-04-19" "2015-04-19" "2015-04-20" "2015-04-20"

我希望我的结果看起来像这样

[1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" 
[21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

我想到了软件包 data.table 提供的滚动连接。

library(data.table)
DT1 <- data.table(date = as.Date(satDat), date1 = as.Date(satDat))
DT2 <- data.table(date = stationDat)

DT1[DT2, date1, roll = "nearest", on = .(date)]
# [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
# [7] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[13] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[19] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[25] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[37] "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

无论您的实际任务是什么,它都可能也很有用,因为我怀疑它不止于此。

使用 outer 的选项:

satDat[apply(abs(outer(satDat, stationDat, difftime, units = 'days')), 2, which.min)]

#>  [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#>  [6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

工作原理:

  • outer 对两个向量中的每对元素应用 difftime,返回一个矩阵,
  • 其中 apply 遍历列 (MARGIN = 2),在每个列上调用 which.min,其中 returns 是最小的索引,
  • 用于子集satDat

请注意,outer 通过 length(stationDat) 分配维度为 length(satDat) 的矩阵,如果您的数据已经很大,这可能需要大量内存。