数据帧中的线性外推

Question

我有一个数据集，其中包含 2009 年至 2019 年区域级别的家庭数量。数据集非常完整，但缺少一些数据。例如，我有这两个区域，IE01 和 IE04：

n2hn_df.loc['IE01']

    Out[2]: 
    2009    455300.0
    2010    460600.0
    2011    465500.0
    2012         NaN
    2013         NaN
    2014         NaN
    2015         NaN
    2016         NaN
    2017         NaN
    2018         NaN
    2019         NaN
    Name: IE01, dtype: float64



n2hn_df.loc['IE04']
Out[3]: 
2009         NaN
2010         NaN
2011         NaN
2012    320700.0
2013    315300.0
2014    310500.0
2015    307500.0
2016    315400.0
2017    323300.0
2018    329300.0
2019    339700.0
Name: IE04, dtype: float64

我想用 线性外推 来完成数据集（因为多年来家庭数量没有发生太大变化）。我知道插值很容易，但是

n2hn_df.interpolate(method='linear',axis=1,limit_direction='both',inplace=True)

只用在两个方向上找到的最接近的值填充数据集。我还没有找到一种简单的方法来推断数据框中的数据，所以我想就最好的方法征求您的意见。如果您能提供任何帮助，我将不胜感激。提前致谢！

编辑：我想从中推断数据的数据帧示例是：

Answer 1

我刚才也做过类似的事情。它不是超级漂亮，但也许你可以使用它。作为示例，我使用以下 DataFrame（第二个示例的修改版本）：

         value
year          
2009       NaN
2010       NaN
2011       NaN
2012  320700.0
2013  315300.0
2014  310500.0
2015  307500.0
2016  315400.0
2017       NaN
2018       NaN
2019       NaN

year 就是 index!

1。 step 正在填满 NaNs:

的结尾部分

increment = df.value.diff(1).mean()
idx_max_notna = df.value[df.value.notna()].index.array[-1]
idx = df.index[df.index >= idx_max_notna]
df.value[idx] = df.value[idx].fillna(increment).cumsum()

结果：

         value
year          
2009       NaN
2010       NaN
2011       NaN
2012  320700.0
2013  315300.0
2014  310500.0
2015  307500.0
2016  315400.0
2017  314075.0
2018  312750.0
2019  311425.0

作为increment我使用了现有diffs的mean。如果你想使用最后一个 diff 然后将其替换为：

increment = df.value.diff(1)[df.value.notna()].array[-1]

2。 step填充NaNs的开始部分大致相同，只是将列value反转，最后重新反转：

df.value = df.value.array[::-1]
increment = df.value.diff(1).mean()
idx_max_notna = df.value[df.value.notna()].index.array[-1]
idx = df.index[df.index >= idx_max_notna]
df.value[idx] = df.value[idx].fillna(increment).cumsum()
df.value = df.value.array[::-1]

结果：

         value
year          
2009  324675.0
2010  323350.0
2011  322025.0
2012  320700.0
2013  315300.0
2014  310500.0
2015  307500.0
2016  315400.0
2017  314075.0
2018  312750.0
2019  311425.0

重要提示：该方法假定指数中没有差距（缺失年份）。

正如我所说，不是很漂亮，但对我有用。

(PS：只是为了澄清上面'similar'的用法：这确实是线性外推。)

编辑

示例框架（截图中框架的前3行）：

n2hn_df = pd.DataFrame(
        {'2010': [134.024, np.NaN, 36.711], '2011': [134.949, np.NaN, 41.6533],
         '2012': [128.193, np.NaN, 33.4578], '2013': [125.131, np.NaN, 33.4578],
         '2014': [122.241, np.NaN, 33.6356], '2015': [115.301, np.NaN, 35.5919],
         '2016': [108.927, 520.38, 40.1008], '2017': [106.101, 523.389, 41.38],
         '2018': [96.1861, 526.139, 49.0906], '2019': [np.NaN, np.NaN, np.NaN]},
        index=pd.Index(data=['AT', 'BE', 'BG'], name='NUTS_ID')
    )

            2010      2011      2012  ...     2017      2018  2019
NUTS_ID                               ...                         
AT       134.024  134.9490  128.1930  ...  106.101   96.1861   NaN
BE           NaN       NaN       NaN  ...  523.389  526.1390   NaN
BG        36.711   41.6533   33.4578  ...   41.380   49.0906   NaN

外推法：

# Transposing frame
n2hn_df = n2hn_df.T
for col in n2hn_df.columns:
    # Extract column
    ser = n2hn_df[col].copy()

    # End piece
    increment = ser.diff(1).mean()
    idx_max_notna = ser[ser.notna()].index.array[-1]
    idx = ser.index[ser.index >= idx_max_notna]
    ser[idx] = ser[idx].fillna(increment).cumsum()

    # Start piece
    ser = pd.Series(ser.array[::-1])
    increment = ser.diff(1).mean()
    idx_max_notna = ser[ser.notna()].index.array[-1]
    idx = ser.index[ser.index >= idx_max_notna]
    ser[idx] = ser[idx].fillna(increment).cumsum()
    n2hn_df[col] = ser.array[::-1]

# Re-transposing frame
n2hn_df = n2hn_df.T

结果：

            2010      2011      2012  ...     2017      2018        2019
NUTS_ID                               ...                               
AT       134.024  134.9490  128.1930  ...  106.101   96.1861   91.456362
BE       503.103  505.9825  508.8620  ...  523.389  526.1390  529.018500
BG        36.711   41.6533   33.4578  ...   41.380   49.0906   50.638050

数据帧中的线性外推

Linear extrapolation in dataframes

python

dataset

scipy

dataframe

pandas

编辑