pandas 数据框列中前向填充缺失值的有效解决方案？

Question

我需要转发组内数据框列中的填充值。我应该注意到，组中的第一个值永远不会因构造而丢失。我目前有以下解决方案。

df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})

# desired output
a   b
1   1
1   1
2   2
2   2
2   2

以下是我迄今为止尝试过的三种解决方案。

# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')

# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')

所有这三个都产生了我想要的输出，但前两个在我的数据集上花费了很长时间，第三个解决方案占用更多内存并且感觉相当笨重。还有其他方法可以向前填充列吗？

Answer 1

这个呢

df.groupby('a').b.transform('ffill')

Answer 2

直接使用 ffill() 会得到最好的结果。下面是对比

%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop

%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop

%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop

Answer 3

您需要按两列排序 df.sort_values(['a', 'b']).ffill() 以确保稳健性。如果 np.nan 留在组内的第一个位置，ffill 将用前一组的值填充它。因为 np.nan 将放在任何排序的末尾，所以按 a 和 b 排序可确保您不会将 np.nan 放在任何组的前面。然后，您可以 .loc 或 .reindex 使用初始索引取回您的原始订单。

这显然会比其他提案慢一点...但是，我认为它将正确，而其他提案则不正确。

演示

考虑数据框 df

df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})

print(df)

   a    b
0  1  1.0
1  1  NaN
2  2  NaN
3  2  2.0
4  2  NaN

尝试

df.sort_values('a').ffill()

   a    b
0  1  1.0
1  1  1.0
2  2  1.0  # <--- this is incorrect
3  2  2.0
4  2  2.0

改为

df.sort_values(['a', 'b']).ffill().loc[df.index]

   a    b
0  1  1.0
1  1  1.0
2  2  2.0
3  2  2.0
4  2  2.0

特别说明
如果整个组都有缺失值，这仍然是不正确的

pandas 数据框列中前向填充缺失值的有效解决方案？

Efficient solution for forward filling missing values in a pandas dataframe column?

python

missing-data

pandas