Python 中缺失温度数据的插值

Question

我有东西伯利亚几个站点的月度温度数据。然而，我工作所必需的一个站点丢失了大量数据，而附近的其他站点覆盖良好。有没有办法根据另一个数据集的行为插入缺失数据？无法提供任何代码，因为我不知道从哪里开始，数据集如下所示：

红点是缺失值站的数据，绿图是覆盖良好的站

如果有人能指出正确的方向，我将不胜感激

Answer 1

有一些方法可以做到这一点，例如，对具有良好覆盖率的数据集应用 FFT，并在删除 high-frequency 项的同时查看它与覆盖率低的数据集的拟合程度。

但是，我非常怀疑这是否有用：覆盖率高的数据集几乎完全适合覆盖率低的数据集。无论您想应用什么方法，在拟合覆盖率低的数据集时，与覆盖率高的数据集相似的最佳函数是覆盖率高的数据集本身。

Answer 2

让我们创建一个试验数据集来解决您的问题：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

t = np.linspace(0, 30*2*np.pi, 30*24*2)
td = pd.date_range("2020-01-01", freq='30T', periods=t.size)

T0 = np.sin(t)*8 - 15 + np.random.randn(t.size)*0.2
T1 = np.sin(t)*7 - 13 + np.random.randn(t.size)*0.1
T2 = np.sin(t)*9 - 10 + np.random.randn(t.size)*0.3
T3 = np.sin(t)*8.5 - 11 + np.random.randn(t.size)*0.5
T = np.vstack([T0, T1, T2, T3]).T

features = pd.DataFrame(T, columns=["s1", "s2", "s3", "s4"], index=td)

看起来像：

axe = features[:"2020-01-04"].plot()
axe.legend()
axe.grid()

那么如果你的时间序列线性相关性很好，你可以简单地通过普通最小二乘回归的平均值来预测缺失值。 SciKit-Learn 提供了一个方便的接口来执行这种计算：

from sklearn import linear_model
from sklearn.model_selection import train_test_split

# Remove target site from features:
target = features.pop("s4")

# Split dataset into train (actual data) and test (missing temperatures):
x_train, x_test, y_train, y_test = train_test_split(features, target, train_size=0.25, random_state=123)

# Create a Linear Regressor and train it:
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)

# Assess regression score with test data:
reg.score(x_test, y_test) # 0.9926150729585087

# Predict missing values:
ypred = reg.predict(x_test)
ypred = pd.DataFrame(ypred, index=x_test.index, columns=["s4p"])

结果如下：

axe = features[:"2020-01-04"].plot()
target[:"2020-01-04"].plot(ax=axe)
ypred[:"2020-01-04"].plot(ax=axe, linestyle='None', marker='.')
axe.legend()
axe.grid()

error = (y_test - ypred.squeeze())
axe = error.plot()
axe.legend(["Prediction Error"])
axe.grid()

Python 中缺失温度数据的插值

Interpolation of missing temperature data in Python

python

interpolation