在观测值之间插值(分段逼近)R
Interpolate between observations (piecewise approximation) R
我正在将一些预测数据与实际值进行比较。这些预测来自三个不同的供应商。但是,实际数据和预测数据的时间戳不同。我想比较进行预测的每个点的误差。
在下面的快照中,我想了解每个供应商的预测值与实际值之间的差异。带圆圈的点代表无法获得实际数据但我们可以看到有明显趋势的预测。我想我可以接受分段近似,但我不确定该怎么做。我看过 Need a R package for piecewise linear regression? 中发布的答案,但帮助不大。
10 天示例:
1 天示例显示偏移 b/w 预测实例和实际数据:
示例数据(1 天)
> dput(dt)
structure(list(tme = structure(c(1516221000, 1516224600, 1516228200,
1516231800, 1516235400, 1516239000, 1516242600, 1516246200, 1516249800,
1516253400, 1516257000, 1516260600, 1516264200, 1516267800, 1516271400,
1516275000, 1516278600, 1516282200, 1516285800, 1516289400, 1516293000,
1516296600, 1516300200, 1516303800, 1516307400, 1516226400, 1516230000,
1516233600, 1516237200, 1516240800, 1516244400, 1516248000, 1516251600,
1516255200, 1516258800, 1516262400, 1516266000, 1516269600, 1516273200,
1516276800, 1516280400, 1516284000, 1516287600, 1516291200, 1516294800,
1516298400, 1516302000, 1516305600, 1516221000, 1516224600, 1516228200,
1516231800, 1516235400, 1516239000, 1516242600, 1516246200, 1516249800,
1516253400, 1516257000, 1516260600, 1516264200, 1516267800, 1516271400,
1516275000, 1516278600, 1516282200, 1516285800, 1516289400, 1516293000,
1516296600, 1516300200, 1516303800, 1516307400, 1516233600, 1516244400,
1516255200, 1516266000, 1516276800, 1516287600, 1516298400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), degc = c(2.25, 1.69, 2.22, 2.22, 1.65, 1.12, 2.22,
1.1, 1.13, 2.82, 5.58, 7.8, 7.85, 8.43, 10.05, 10.06, 10.07,
10.03, 8.89, 6.17, 5.04, 5.01, 3.92, 2.29, 2.29, -1, -1, -1,
-1, -1, 0, 1, 2, 4, 6, 7, 8, 8, 9, 9, 9, 7, 6, 4, 3, 2, 2, 1,
-0.16, -1.13, -2.19, -2.98, -3.48, -3.86, -3.84, -2.96, -1.16,
0.91, 2.61, 3.92, 4.84, 5.59, 6.68, 7.41, 6.82, 5.08, 3.07, 1.56,
0.51, -0.36, -1.15, -1.86, -2.53, -0.2, -0.9, 4.1, 6.9, 8.1,
3.6, 2.6), rh = c(0.55, 0.6, 0.51, 0.51, 0.6, 0.52, 0.55, 0.57,
0.6, 0.49, 0.44, 0.41, 0.38, 0.36, 0.33, 0.33, 0.31, 0.33, 0.35,
0.39, 0.4, 0.4, 0.43, 0.49, 0.49, 73, 73, 75, 75, 75, 71, 67,
59, 52, 47, 42, 39, 37, 35, 34, 37, 43, 48, 51, 54, 58, 61, 62,
0.61, 0.64, 0.67, 0.7, 0.72, 0.74, 0.74, 0.71, 0.65, 0.58, 0.54,
0.52, 0.51, 0.5, 0.46, 0.44, 0.45, 0.5, 0.57, 0.61, 0.64, 0.65,
0.67, 0.69, 0.71, 59.1, 62.6, 43.9, 36.7, 33.2, 46.4, 50.1),
type = c("Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider B", "Provider B", "Provider B", "Provider B", "Provider B",
"Provider B", "Provider B")), .Names = c("tme", "degc", "rh",
"type"), row.names = c(NA, -80L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000000120788>)
我真的不确定如何进行。我需要对最多 30 个变量(示例数据只有两个)的几个数据集(每个数据集几百行)重复此练习。
我认为你要的是这个
fAct = approxfun(dt$tme[dt$type=='Actual'], dt$degc[dt$type=='Actual'], )
这给出了实际值的分段线性近似值。然后,您可以将其与来自不同提供商的值进行比较。例如,
> dt[35,]
tme degc rh type
35 2018-01-18 07:00:00 6 47 Provider W
> fAct(dt[35,'tme'])
[1] 6.69
因此提供商 W 预测 degc 在时间 2018-01-18 07:00:00 将为 6。实际值(近似值)为 6.69,因此误差为 0.69。
编辑
如@RalfStubner 所述,您可以使用
获得更平滑的 (non-linear) 近似值
fAct2 = splinefun(dt$tme[dt$type=='Actual'], dt$degc[dt$type=='Actual'])
您可以像这样使用 approx
为提供者 W 创建每个时间点的实际值的线性插值:
interpolated <- approx(x = dt[dt$type == "Actual", ]$tme,
y = dt[dt$type == "Actual", ]$degc,
xout = dt[dt$type == "Provider W", ]$tme)
dt[dt$type == "Provider W", ]$degc - interpolated$y
# [1] -2.955 -3.220 -2.935 -2.385 -2.670 -1.660 -0.115 0.025 -0.200 -0.690 -0.825 -0.140 -1.240 -1.055 -1.065 -1.050 -2.460
# [18] -1.530 -1.605 -2.025 -2.465 -1.105 -1.290
我正在将一些预测数据与实际值进行比较。这些预测来自三个不同的供应商。但是,实际数据和预测数据的时间戳不同。我想比较进行预测的每个点的误差。
在下面的快照中,我想了解每个供应商的预测值与实际值之间的差异。带圆圈的点代表无法获得实际数据但我们可以看到有明显趋势的预测。我想我可以接受分段近似,但我不确定该怎么做。我看过 Need a R package for piecewise linear regression? 中发布的答案,但帮助不大。
10 天示例:
1 天示例显示偏移 b/w 预测实例和实际数据:
示例数据(1 天)
> dput(dt)
structure(list(tme = structure(c(1516221000, 1516224600, 1516228200,
1516231800, 1516235400, 1516239000, 1516242600, 1516246200, 1516249800,
1516253400, 1516257000, 1516260600, 1516264200, 1516267800, 1516271400,
1516275000, 1516278600, 1516282200, 1516285800, 1516289400, 1516293000,
1516296600, 1516300200, 1516303800, 1516307400, 1516226400, 1516230000,
1516233600, 1516237200, 1516240800, 1516244400, 1516248000, 1516251600,
1516255200, 1516258800, 1516262400, 1516266000, 1516269600, 1516273200,
1516276800, 1516280400, 1516284000, 1516287600, 1516291200, 1516294800,
1516298400, 1516302000, 1516305600, 1516221000, 1516224600, 1516228200,
1516231800, 1516235400, 1516239000, 1516242600, 1516246200, 1516249800,
1516253400, 1516257000, 1516260600, 1516264200, 1516267800, 1516271400,
1516275000, 1516278600, 1516282200, 1516285800, 1516289400, 1516293000,
1516296600, 1516300200, 1516303800, 1516307400, 1516233600, 1516244400,
1516255200, 1516266000, 1516276800, 1516287600, 1516298400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), degc = c(2.25, 1.69, 2.22, 2.22, 1.65, 1.12, 2.22,
1.1, 1.13, 2.82, 5.58, 7.8, 7.85, 8.43, 10.05, 10.06, 10.07,
10.03, 8.89, 6.17, 5.04, 5.01, 3.92, 2.29, 2.29, -1, -1, -1,
-1, -1, 0, 1, 2, 4, 6, 7, 8, 8, 9, 9, 9, 7, 6, 4, 3, 2, 2, 1,
-0.16, -1.13, -2.19, -2.98, -3.48, -3.86, -3.84, -2.96, -1.16,
0.91, 2.61, 3.92, 4.84, 5.59, 6.68, 7.41, 6.82, 5.08, 3.07, 1.56,
0.51, -0.36, -1.15, -1.86, -2.53, -0.2, -0.9, 4.1, 6.9, 8.1,
3.6, 2.6), rh = c(0.55, 0.6, 0.51, 0.51, 0.6, 0.52, 0.55, 0.57,
0.6, 0.49, 0.44, 0.41, 0.38, 0.36, 0.33, 0.33, 0.31, 0.33, 0.35,
0.39, 0.4, 0.4, 0.43, 0.49, 0.49, 73, 73, 75, 75, 75, 71, 67,
59, 52, 47, 42, 39, 37, 35, 34, 37, 43, 48, 51, 54, 58, 61, 62,
0.61, 0.64, 0.67, 0.7, 0.72, 0.74, 0.74, 0.71, 0.65, 0.58, 0.54,
0.52, 0.51, 0.5, 0.46, 0.44, 0.45, 0.5, 0.57, 0.61, 0.64, 0.65,
0.67, 0.69, 0.71, 59.1, 62.6, 43.9, 36.7, 33.2, 46.4, 50.1),
type = c("Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Actual", "Actual", "Actual", "Actual",
"Actual", "Actual", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider W", "Provider W", "Provider W", "Provider W", "Provider W",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider D", "Provider D", "Provider D", "Provider D", "Provider D",
"Provider B", "Provider B", "Provider B", "Provider B", "Provider B",
"Provider B", "Provider B")), .Names = c("tme", "degc", "rh",
"type"), row.names = c(NA, -80L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000000120788>)
我真的不确定如何进行。我需要对最多 30 个变量(示例数据只有两个)的几个数据集(每个数据集几百行)重复此练习。
我认为你要的是这个
fAct = approxfun(dt$tme[dt$type=='Actual'], dt$degc[dt$type=='Actual'], )
这给出了实际值的分段线性近似值。然后,您可以将其与来自不同提供商的值进行比较。例如,
> dt[35,]
tme degc rh type
35 2018-01-18 07:00:00 6 47 Provider W
> fAct(dt[35,'tme'])
[1] 6.69
因此提供商 W 预测 degc 在时间 2018-01-18 07:00:00 将为 6。实际值(近似值)为 6.69,因此误差为 0.69。
编辑
如@RalfStubner 所述,您可以使用
获得更平滑的 (non-linear) 近似值fAct2 = splinefun(dt$tme[dt$type=='Actual'], dt$degc[dt$type=='Actual'])
您可以像这样使用 approx
为提供者 W 创建每个时间点的实际值的线性插值:
interpolated <- approx(x = dt[dt$type == "Actual", ]$tme,
y = dt[dt$type == "Actual", ]$degc,
xout = dt[dt$type == "Provider W", ]$tme)
dt[dt$type == "Provider W", ]$degc - interpolated$y
# [1] -2.955 -3.220 -2.935 -2.385 -2.670 -1.660 -0.115 0.025 -0.200 -0.690 -0.825 -0.140 -1.240 -1.055 -1.065 -1.050 -2.460
# [18] -1.530 -1.605 -2.025 -2.465 -1.105 -1.290