在 pandas 中将每小时重采样为每日数据时保留列

Keep column when resampling hourly to daily data in pandas

我有一个这种格式的每小时天气观测数据集:

df = pd.DataFrame({ 'date': ['2019-01-01 09:30:00', '2019-01-01 10:00', '2019-01-02 04:30:00','2019-01-02 05:00:00','2019-07-04 02:00:00'],
                  'windSpeedHigh': [155,90,35,45,15],
                   'windSpeedHigh_Dir':['NE','NNW','SW','W','S']})

我的目标是找到每天的最高风速以及与该每日最大风速相关联的风向。

使用重采样,我成功地找到了每天的最大风速,但没有找到它的相关方向:

df['date'] = pd.to_datetime(df['date'])
df['windSpeedHigh'] = pd.to_numeric(df['windSpeedHigh'])
df_daily = df.resample('D', on='date')[['windSpeedHigh_Dir','windSpeedHigh']].max()
df_daily

结果:

windSpeedHigh_Dir   windSpeedHigh
date        
2019-01-01  NNW 155.0
2019-01-02  W   45.0
2019-01-03  NaN NaN
2019-01-04  NaN NaN
2019-01-05  NaN NaN
... ... ...
2019-06-30  NaN NaN
2019-07-01  NaN NaN
2019-07-02  NaN NaN
2019-07-03  NaN NaN
2019-07-04  S   15.0

这是不正确的,因为此重采样还获取了 'windSpeedHigh_Dir' 的 max()。对于 2019-01-01,相关风速的方向应该是 'NE' 而不是 'NNW',因为当最大风速时风向 df['windSpeedHigh_Dir'] == 'NE'发生了。

所以我的问题是,我是否可以将此数据集从半小时最大风速重新采样到每日最大风速,同时保持风向与该速度相关联?

首先使用 DataFrameGroupBy.idxmax 按日期索引:

df_daily = df.loc[df.groupby(df['date'].dt.date)['windSpeedHigh'].idxmax()]
print (df_daily)
                 date  windSpeedHigh windSpeedHigh_Dir
0 2019-01-01 09:30:00            155                NE
3 2019-01-02 05:00:00             45                 W
4 2019-07-04 02:00:00             15                 S

然后添加 DatetimeIndex 使用 DataFrame.set_index with Series.dt.normalize and DataFrame.asfreq:

df_daily = df_daily.set_index(df_daily['date'].dt.normalize().rename('day')).asfreq('d')
print (df_daily)
                          date  windSpeedHigh windSpeedHigh_Dir
day                                                            
2019-01-01 2019-01-01 09:30:00          155.0                NE
2019-01-02 2019-01-02 05:00:00           45.0                 W
2019-01-03                 NaT            NaN               NaN
2019-01-04                 NaT            NaN               NaN
2019-01-05                 NaT            NaN               NaN
                       ...            ...               ...
2019-06-30                 NaT            NaN               NaN
2019-07-01                 NaT            NaN               NaN
2019-07-02                 NaT            NaN               NaN
2019-07-03                 NaT            NaN               NaN
2019-07-04 2019-07-04 02:00:00           15.0                 S

[185 rows x 3 columns]

您的解决方案应该使用自定义函数,因为 idxmax 因缺失值而失败 DataFrame.join:

f = lambda x: x.idxmax() if len(x) > 0 else np.nan
df_daily = df.resample('D', on='date')['windSpeedHigh'].agg(f).to_frame('idx').join(df, on='idx')

print (df_daily)
            idx                date  windSpeedHigh windSpeedHigh_Dir
date                                                                
2019-01-01  0.0 2019-01-01 09:30:00          155.0                NE
2019-01-02  3.0 2019-01-02 05:00:00           45.0                 W
2019-01-03  NaN                 NaT            NaN               NaN
2019-01-04  NaN                 NaT            NaN               NaN
2019-01-05  NaN                 NaT            NaN               NaN
        ...                 ...            ...               ...
2019-06-30  NaN                 NaT            NaN               NaN
2019-07-01  NaN                 NaT            NaN               NaN
2019-07-02  NaN                 NaT            NaN               NaN
2019-07-03  NaN                 NaT            NaN               NaN
2019-07-04  4.0 2019-07-04 02:00:00           15.0                 S

[185 rows x 4 columns]