如何在 pandas DataFrame 中取值两边的平均值？

Question

我有一个 Pandas DataFrame，其中索引是一天中每 12 分钟的日期时间（总共 120 行）。我继续每 30 分钟对数据重新采样一次。

                 Time  Rain_Rate
1 2014-04-02 00:00:00       0.50
2 2014-04-02 00:30:00       1.10
3 2014-04-02 01:00:00       0.48
4 2014-04-02 01:30:00       2.30
5 2014-04-02 02:00:00       4.10
6 2014-04-02 02:30:00       5.00
7 2014-04-02 03:00:00       3.20

我想取以 00、03、06、09、12、15、18 和 21 小时为中心的 3 小时均值。我希望均值包含 03:00:00 之前的 1.5 小时（所以 01:30:00) 和 03:00:00 (04:30:00) 后 1.5 小时。 06:00:00 时间将与 03:00:00 平均值重叠（它们都使用 04:30:00）。有没有办法使用 pandas 来做到这一点？我尝试了一些东西，但没有奏效。

Answer 1

方法一

我建议从一开始就改变你的 resample 以获得你想要的块。在完全重新采样之前，这里有一些与您的相似的假数据：

dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'

#df.head()
                     Rain_Rate
Time                          
2014-04-02 00:00:00   0.616588
2014-04-02 00:12:00   0.201390
2014-04-02 00:24:00   0.802754
2014-04-02 00:36:00   0.712743
2014-04-02 00:48:00   0.711766

最初按 3 小时的块取平均值与先做 30 分钟的块然后再做 3 小时的块是一样的。您只需调整几项即可获得所需的正确垃圾箱。首先，您可以添加开始的 bin（即前一天的 10:30 pm，即使那里没有数据；第一个 bin 从 10:30pm - 1:30am），然后resample从这里开始

before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()

output = df.resample('3H', base=22.5, loffset='90min').mean()

这里的base参数表示从第22.5小时开始（10:30），loffset表示将bin名称向后推90分钟。您会得到以下输出：

                     Rain_Rate
Time                          
2014-04-02 00:00:00   0.555515
2014-04-02 03:00:00   0.546571
2014-04-02 06:00:00   0.439953
2014-04-02 09:00:00   0.460898
2014-04-02 12:00:00   0.506690
2014-04-02 15:00:00   0.605775
2014-04-02 18:00:00   0.448838
2014-04-02 21:00:00   0.387380
2014-04-03 00:00:00   0.604204  #this is the bin at midnight on the following day

你也可以从分箱30分钟的数据开始，用这个方法，应该会得到同样的答案。*

方法二

另一种方法是找到要为其创建平均值的索引的位置，然后计算 3 小时内条目的平均值：

resampled = df.resample('30T',).mean() #like your data in the post

centers = [0,3,6,9,12,15,18,21]

mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]

output = []
for center in df_centers:
    cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
    cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
    output.append(df[cond1 & cond2].values.mean())

这里的输出是一样的，但是答案在一个list中（不包括“24小时”的最后一点）：

[0.5555146139562004,
 0.5465709237162698,
 0.43995277270996735,
 0.46089800625663596,
 0.5066902552121085,
 0.6057747262752732,
 0.44883794039466535,
 0.3873795731806939]

*您提到您希望两个箱子中都包含箱子边缘的一些点。 resample 不会这样做（通常我认为大多数人都不想这样做），但我使用的第二种方法明确说明了这样做（通过使用 >= 和 <= 在 cond1 和 cond2 中）。然而，这两种方法在这里取得了相同的结果，大概 b/c 在不同阶段使用 resample 导致数据点被包含在不同的 bin 中。我很难概括这一点，但可以做一些手动分箱来验证发生了什么。关键是，我建议根据您的原始数据抽查这些方法（或任何基于 resample 的方法）的输出，以确保一切看起来正确。对于这些示例，我使用 Excel.

如何在 pandas DataFrame 中取值两边的平均值？

How do I take the mean on either side of a value in a pandas DataFrame?

python

datetime

interpolation

mean

pandas

方法一

方法二