通过在具有 pandas 的时间序列中将值分配给先前的 NaN 来回填值

Question

我有一个时间序列，其中每个观察值代表自上次观察以来的事物总量，如果在该时间步长内没有观察值，则该值报告为 NaN。格式示例：

Timestep  Value
1          10
2          NaN
3          NaN
4          9
5          NaN
6          NaN
7          NaN
8          16
9          NaN
10         NaN

我想做的是将观察到的值分布到它之前的 NaN 中。例如，像 [5, NaN, NaN, 6] 这样的序列将变为 [5, 2, 2, 2]，最终观察值 6 分布在最后 2 个 NaN 值上。应用于所需输出上方的数据框将是：

Timestep  Value
1          10
2          3
3          3
4          3
5          4
6          4
7          4
8          4
9          NaN
10         NaN

我已经尝试使用一些 pandas 回填和插值方法来执行此操作，但没有找到完全符合我要求的方法。

Answer 1

`transform`

df.Value.bfill().div(
    df.groupby(df.Value.notna()[::-1].cumsum()).Value.transform('size')
)

0    10.0
1     3.0
2     3.0
3     3.0
4     4.0
5     4.0
6     4.0
7     4.0
8     NaN
9     NaN
Name: Value, dtype: float64

`np.bincount` 和 `pd.factorize`

a = df.Value.notna().values
f, u = pd.factorize(a[::-1].cumsum()[::-1])

df.Value.bfill().div(np.bincount(f)[f])

0    10.0
1     3.0
2     3.0
3     3.0
4     4.0
5     4.0
6     4.0
7     4.0
8     NaN
9     NaN
Name: Value, dtype: float64

替代的较短版本。这是有效的，因为 cumsum 自然地给了我 factorize 的功能。

a = df.Value.notna().values[::-1].cumsum()[::-1]
df.Value.bfill().div(np.bincount(a)[a])

详情

在上面的两个选项中，我们需要确定空值的位置，并在反转系列上使用 cumsum 来定义组。在 transform 选项中，我使用 groupby 和 size 来计算这些组的大小。

第二个选项使用 bin 计数和切片来获得同一系列。

谢谢@ScottBoston 提醒我提到反转元素[::-1]

Answer 2

计算累计 NA，然后我们做update

s=df.Value.notnull().cumsum().shift(1)
df.Value.update(df.Value.bfill()/s.groupby(s).transform('count'))
df
Out[885]: 
   Timestep  Value
0         1   10.0
1         2    3.0
2         3    3.0
3         4    3.0
4         5    4.0
5         6    4.0
6         7    4.0
7         8    4.0
8         9    NaN
9        10    NaN

通过在具有 pandas 的时间序列中将值分配给先前的 NaN 来回填值

Backfill values by distributing values across prior NaNs in a timeseries with pandas

python

interpolation

time-series

dataframe

pandas

`transform`

`np.bincount` 和 `pd.factorize`

详情

通过在具有 pandas 的时间序列中将值分配给先前的 NaN 来回填值

Backfill values by distributing values across prior NaNs in a timeseries with pandas

python

interpolation

time-series

dataframe

pandas

transform

np.bincount 和 pd.factorize

详情

`transform`

`np.bincount` 和 `pd.factorize`