插值受另一个变量限制的时间序列数据
Interpolating time series data limited by another variable
我正在寻找一种在合并两个数据集时进行插值的方法。
我有一个数据框,其中包含许多不同数据记录器的读数,第二个数据框带有现场测量值。我需要同时将手部测量值与记录器读数进行匹配,以便我可以比较它们并计算偏移量。不幸的是,记录器的测量值是有规律的间隔(每小时),而手动测量值却不是这样。
我想对记录器值进行插值以获得读数的记录器值,但我正在努力寻找如何确保获取正确的测量值。
示例数据
library(tidyverse)
Start.date <- "2019-12-18 00:00:00"
end.date <- "2019-12-20 00:00:00"
set.seed(100)
loggers <- tibble(
datetime = rep(seq.POSIXt(as.POSIXct(Start.date), as.POSIXct(end.date), by='1 hour'),4),
Site = rep(LETTERS[1:4], each = 49),
reading = c(rnorm(49, mean = 10, sd = 3),
rnorm(49, mean = 15, sd = 3),
rnorm(49, mean = 20, sd = 3),
rnorm(49, mean = 25, sd = 3)
)
)
hand_meas <- tibble(
Site = rep(LETTERS[1:4], each = 2),
datetime = as.POSIXct(rep(c("2019-12-18 12:35:00", "2019-12-19 13:45:00", "2019-12-18 12:55:00", "2019-12-19 13:15:00" ),2)),
meas = c(10, 11, 14, 16, 19, 19.2, 23, 24)
)
head(loggers)
# # A tibble: 6 x 3
# datetime Site reading
# <dttm> <chr> <dbl>
# 1 2019-12-18 00:00:00 A 7.65
# 2 2019-12-18 01:00:00 A 6.99
# 3 2019-12-18 02:00:00 A 13.8
# 4 2019-12-18 03:00:00 A 12.3
# 5 2019-12-18 04:00:00 A 11.6
# 6 2019-12-18 05:00:00 A 14.3
head(hand_meas)
# # A tibble: 6 x 3
# Site datetime meas
# <chr> <dttm> <dbl>
# 1 A 2019-12-18 12:35:00 10
# 2 A 2019-12-19 13:45:00 11
# 3 B 2019-12-18 12:55:00 14
# 4 B 2019-12-19 13:15:00 16
# 5 C 2019-12-18 12:35:00 19
# 6 C 2019-12-19 13:45:00 19.2
我的典型方法是 left_join()
记录器数据到手部测量值,或者使用 approx()
插值,但在这种情况下这些都不起作用。
## This fails because it needs exact matches
left_join(hand_meas, loggers, by = c("Site", "datetime"))
# # A tibble: 8 x 4
# Site datetime meas reading
# <chr> <dttm> <dbl> <dbl>
# 1 A 2019-12-18 12:35:00 10 NA
# 2 A 2019-12-19 13:45:00 11 NA
# 3 B 2019-12-18 12:55:00 14 NA
# 4 B 2019-12-19 13:15:00 16 NA
# 5 C 2019-12-18 12:35:00 19 NA
# 6 C 2019-12-19 13:45:00 19.2 NA
# 7 D 2019-12-18 12:55:00 23 NA
# 8 D 2019-12-19 13:15:00 24 NA
## Succeeds, but does includes readings from all of the sites
approx(loggers$datetime, loggers$reading, hand_meas$datetime)
# $x
# [1] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# [5] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
#
# $y
# [1] 17.67616 19.19072 17.75920 18.91207 17.67616 19.19072 17.75920 18.91207
#
# Warning message:
# In regularize.values(x, y, ties, missing(ties)) :
# collapsing to unique 'x' values
我也可以使用 data.table 来获取最近的记录器值,但是我的真实数据在一天中有很大的波动,所以需要从任何一侧插入测量值
# This is close, Using data.table to join based on nearest timestamp see question 31818444
#
library(data.table)
setDT(hand_meas)[, logger_reading := setDT(loggers)[hand_meas, reading, on = c("Site", "datetime"), roll = "nearest"]]
head(hand_meas)
# Site datetime meas logger_reading
# 1: A 2019-12-18 12:35:00 10.0 12.21952
# 2: A 2019-12-19 13:45:00 11.0 13.19621
# 3: B 2019-12-18 12:55:00 14.0 13.86335
# 4: B 2019-12-19 13:15:00 16.0 15.64910
# 5: C 2019-12-18 12:35:00 19.0 20.76380
# 6: C 2019-12-19 13:45:00 19.2 19.54722
任何人都可以建议一种方法来做类似 approx()
的事情,同时限制基于站点的源数据吗?或者 data.table 插值而不是严格匹配的方法?
我意识到我可以通过编写一个函数来结合这两种方法,该函数按站点限制源数据,然后插入到读数中。
approx_by_site = function(site_id, datetime_in) {
dt = loggers %>%
filter(Site == site_id)
out = approx(dt$datetime, dt$reading, datetime_in)
return(out$y)
}
with_readings = hand_meas %>%
rowwise() %>% # required or returns bad data
mutate( Logger = approx_by_site(Site, datetime)) %>%
ungroup()
with_readings
# # A tibble: 8 x 4
# Site datetime meas Logger
# <chr> <dttm> <dbl> <dbl>
# 1 A 2019-12-18 12:35:00 10 11.0
# 2 A 2019-12-19 13:45:00 11 12.7
# 3 B 2019-12-18 12:55:00 14 13.9
# 4 B 2019-12-19 13:15:00 16 16.1
# 5 C 2019-12-18 12:35:00 19 20.7
# 6 C 2019-12-19 13:45:00 19.2 20.0
# 7 D 2019-12-18 12:55:00 23 24.4
# 8 D 2019-12-19 13:15:00 24 26.9
这行得通,我认为需要 rowwise()
调用,因为我的函数中有过滤器。
我担心如果源(记录器)数据文件由于重复调用过滤器而变大,此方法可能会陷入困境。
我正在寻找一种在合并两个数据集时进行插值的方法。
我有一个数据框,其中包含许多不同数据记录器的读数,第二个数据框带有现场测量值。我需要同时将手部测量值与记录器读数进行匹配,以便我可以比较它们并计算偏移量。不幸的是,记录器的测量值是有规律的间隔(每小时),而手动测量值却不是这样。
我想对记录器值进行插值以获得读数的记录器值,但我正在努力寻找如何确保获取正确的测量值。
示例数据
library(tidyverse)
Start.date <- "2019-12-18 00:00:00"
end.date <- "2019-12-20 00:00:00"
set.seed(100)
loggers <- tibble(
datetime = rep(seq.POSIXt(as.POSIXct(Start.date), as.POSIXct(end.date), by='1 hour'),4),
Site = rep(LETTERS[1:4], each = 49),
reading = c(rnorm(49, mean = 10, sd = 3),
rnorm(49, mean = 15, sd = 3),
rnorm(49, mean = 20, sd = 3),
rnorm(49, mean = 25, sd = 3)
)
)
hand_meas <- tibble(
Site = rep(LETTERS[1:4], each = 2),
datetime = as.POSIXct(rep(c("2019-12-18 12:35:00", "2019-12-19 13:45:00", "2019-12-18 12:55:00", "2019-12-19 13:15:00" ),2)),
meas = c(10, 11, 14, 16, 19, 19.2, 23, 24)
)
head(loggers)
# # A tibble: 6 x 3
# datetime Site reading
# <dttm> <chr> <dbl>
# 1 2019-12-18 00:00:00 A 7.65
# 2 2019-12-18 01:00:00 A 6.99
# 3 2019-12-18 02:00:00 A 13.8
# 4 2019-12-18 03:00:00 A 12.3
# 5 2019-12-18 04:00:00 A 11.6
# 6 2019-12-18 05:00:00 A 14.3
head(hand_meas)
# # A tibble: 6 x 3
# Site datetime meas
# <chr> <dttm> <dbl>
# 1 A 2019-12-18 12:35:00 10
# 2 A 2019-12-19 13:45:00 11
# 3 B 2019-12-18 12:55:00 14
# 4 B 2019-12-19 13:15:00 16
# 5 C 2019-12-18 12:35:00 19
# 6 C 2019-12-19 13:45:00 19.2
我的典型方法是 left_join()
记录器数据到手部测量值,或者使用 approx()
插值,但在这种情况下这些都不起作用。
## This fails because it needs exact matches
left_join(hand_meas, loggers, by = c("Site", "datetime"))
# # A tibble: 8 x 4
# Site datetime meas reading
# <chr> <dttm> <dbl> <dbl>
# 1 A 2019-12-18 12:35:00 10 NA
# 2 A 2019-12-19 13:45:00 11 NA
# 3 B 2019-12-18 12:55:00 14 NA
# 4 B 2019-12-19 13:15:00 16 NA
# 5 C 2019-12-18 12:35:00 19 NA
# 6 C 2019-12-19 13:45:00 19.2 NA
# 7 D 2019-12-18 12:55:00 23 NA
# 8 D 2019-12-19 13:15:00 24 NA
## Succeeds, but does includes readings from all of the sites
approx(loggers$datetime, loggers$reading, hand_meas$datetime)
# $x
# [1] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# [5] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
#
# $y
# [1] 17.67616 19.19072 17.75920 18.91207 17.67616 19.19072 17.75920 18.91207
#
# Warning message:
# In regularize.values(x, y, ties, missing(ties)) :
# collapsing to unique 'x' values
我也可以使用 data.table 来获取最近的记录器值,但是我的真实数据在一天中有很大的波动,所以需要从任何一侧插入测量值
# This is close, Using data.table to join based on nearest timestamp see question 31818444
#
library(data.table)
setDT(hand_meas)[, logger_reading := setDT(loggers)[hand_meas, reading, on = c("Site", "datetime"), roll = "nearest"]]
head(hand_meas)
# Site datetime meas logger_reading
# 1: A 2019-12-18 12:35:00 10.0 12.21952
# 2: A 2019-12-19 13:45:00 11.0 13.19621
# 3: B 2019-12-18 12:55:00 14.0 13.86335
# 4: B 2019-12-19 13:15:00 16.0 15.64910
# 5: C 2019-12-18 12:35:00 19.0 20.76380
# 6: C 2019-12-19 13:45:00 19.2 19.54722
任何人都可以建议一种方法来做类似 approx()
的事情,同时限制基于站点的源数据吗?或者 data.table 插值而不是严格匹配的方法?
我意识到我可以通过编写一个函数来结合这两种方法,该函数按站点限制源数据,然后插入到读数中。
approx_by_site = function(site_id, datetime_in) {
dt = loggers %>%
filter(Site == site_id)
out = approx(dt$datetime, dt$reading, datetime_in)
return(out$y)
}
with_readings = hand_meas %>%
rowwise() %>% # required or returns bad data
mutate( Logger = approx_by_site(Site, datetime)) %>%
ungroup()
with_readings
# # A tibble: 8 x 4
# Site datetime meas Logger
# <chr> <dttm> <dbl> <dbl>
# 1 A 2019-12-18 12:35:00 10 11.0
# 2 A 2019-12-19 13:45:00 11 12.7
# 3 B 2019-12-18 12:55:00 14 13.9
# 4 B 2019-12-19 13:15:00 16 16.1
# 5 C 2019-12-18 12:35:00 19 20.7
# 6 C 2019-12-19 13:45:00 19.2 20.0
# 7 D 2019-12-18 12:55:00 23 24.4
# 8 D 2019-12-19 13:15:00 24 26.9
这行得通,我认为需要 rowwise()
调用,因为我的函数中有过滤器。
我担心如果源(记录器)数据文件由于重复调用过滤器而变大,此方法可能会陷入困境。