插值受另一个变量限制的时间序列数据

Interpolating time series data limited by another variable

我正在寻找一种在合并两个数据集时进行插值的方法。

我有一个数据框,其中包含许多不同数据记录器的读数,第二个数据框带有现场测量值。我需要同时将手部测量值与记录器读数进行匹配,以便我可以比较它们并计算偏移量。不幸的是,记录器的测量值是有规律的间隔(每小时),而手动测量值却不是这样。

我想对记录器值进行插值以获得读数的记录器值,但我正在努力寻找如何确保获取正确的测量值。

示例数据

library(tidyverse)

Start.date <- "2019-12-18 00:00:00"
end.date <- "2019-12-20 00:00:00"
set.seed(100)

loggers <- tibble(
      datetime = rep(seq.POSIXt(as.POSIXct(Start.date), as.POSIXct(end.date), by='1 hour'),4),
      Site = rep(LETTERS[1:4], each = 49),
      reading = c(rnorm(49, mean = 10, sd = 3), 
                  rnorm(49, mean = 15, sd = 3), 
                  rnorm(49, mean = 20, sd = 3), 
                  rnorm(49, mean = 25, sd = 3)
                  )
)

hand_meas <- tibble(
      Site = rep(LETTERS[1:4], each = 2),
      datetime = as.POSIXct(rep(c("2019-12-18 12:35:00", "2019-12-19 13:45:00", "2019-12-18 12:55:00", "2019-12-19 13:15:00" ),2)),
      meas = c(10, 11, 14, 16, 19, 19.2, 23, 24) 
)      

head(loggers)
# # A tibble: 6 x 3
# datetime            Site  reading
# <dttm>              <chr>   <dbl>
#       1 2019-12-18 00:00:00 A        7.65
# 2 2019-12-18 01:00:00 A        6.99
# 3 2019-12-18 02:00:00 A       13.8 
# 4 2019-12-18 03:00:00 A       12.3 
# 5 2019-12-18 04:00:00 A       11.6 
# 6 2019-12-18 05:00:00 A       14.3 

head(hand_meas)
# # A tibble: 6 x 3
# Site  datetime             meas
# <chr> <dttm>              <dbl>
# 1 A     2019-12-18 12:35:00  10  
# 2 A     2019-12-19 13:45:00  11  
# 3 B     2019-12-18 12:55:00  14  
# 4 B     2019-12-19 13:15:00  16  
# 5 C     2019-12-18 12:35:00  19  
# 6 C     2019-12-19 13:45:00  19.2


我的典型方法是 left_join() 记录器数据到手部测量值,或者使用 approx() 插值,但在这种情况下这些都不起作用。

## This fails because it needs exact matches
left_join(hand_meas, loggers, by = c("Site", "datetime"))

# # A tibble: 8 x 4
# Site  datetime             meas reading
# <chr> <dttm>              <dbl>   <dbl>
#       1 A     2019-12-18 12:35:00  10        NA
# 2 A     2019-12-19 13:45:00  11        NA
# 3 B     2019-12-18 12:55:00  14        NA
# 4 B     2019-12-19 13:15:00  16        NA
# 5 C     2019-12-18 12:35:00  19        NA
# 6 C     2019-12-19 13:45:00  19.2      NA
# 7 D     2019-12-18 12:55:00  23        NA
# 8 D     2019-12-19 13:15:00  24        NA

## Succeeds, but does includes readings from all of the sites
approx(loggers$datetime, loggers$reading, hand_meas$datetime)
# $x
# [1] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# [5] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# 
# $y
# [1] 17.67616 19.19072 17.75920 18.91207 17.67616 19.19072 17.75920 18.91207
# 
# Warning message:
#       In regularize.values(x, y, ties, missing(ties)) :
#       collapsing to unique 'x' values

我也可以使用 data.table 来获取最近的记录器值,但是我的真实数据在一天中有很大的波动,所以需要从任何一侧插入测量值

# This is close, Using data.table to join based on nearest timestamp see question 31818444
#  
library(data.table)

setDT(hand_meas)[, logger_reading := setDT(loggers)[hand_meas, reading, on = c("Site", "datetime"), roll = "nearest"]]
head(hand_meas)
# Site            datetime meas logger_reading
# 1:    A 2019-12-18 12:35:00 10.0       12.21952
# 2:    A 2019-12-19 13:45:00 11.0       13.19621
# 3:    B 2019-12-18 12:55:00 14.0       13.86335
# 4:    B 2019-12-19 13:15:00 16.0       15.64910
# 5:    C 2019-12-18 12:35:00 19.0       20.76380
# 6:    C 2019-12-19 13:45:00 19.2       19.54722

任何人都可以建议一种方法来做类似 approx() 的事情,同时限制基于站点的源数据吗?或者 data.table 插值而不是严格匹配的方法?

我意识到我可以通过编写一个函数来结合这两种方法,该函数按站点限制源数据,然后插入到读数中。

approx_by_site = function(site_id, datetime_in) {
      dt = loggers %>%
            filter(Site == site_id)
      out = approx(dt$datetime, dt$reading, datetime_in)
      return(out$y)
}
with_readings = hand_meas %>%
      rowwise() %>%  # required or returns bad data
      mutate( Logger = approx_by_site(Site, datetime)) %>%
      ungroup()

with_readings
# # A tibble: 8 x 4
# Site  datetime             meas Logger
# <chr> <dttm>              <dbl>  <dbl>
# 1 A     2019-12-18 12:35:00  10     11.0
# 2 A     2019-12-19 13:45:00  11     12.7
# 3 B     2019-12-18 12:55:00  14     13.9
# 4 B     2019-12-19 13:15:00  16     16.1
# 5 C     2019-12-18 12:35:00  19     20.7
# 6 C     2019-12-19 13:45:00  19.2   20.0
# 7 D     2019-12-18 12:55:00  23     24.4
# 8 D     2019-12-19 13:15:00  24     26.9

这行得通,我认为需要 rowwise() 调用,因为我的函数中有过滤器。
我担心如果源(记录器)数据文件由于重复调用过滤器而变大,此方法可能会陷入困境。