pandas 中的布尔值重采样

Question

我将运行转换为属性，我发现在 pandas 中重新采样布尔值很奇怪。这是一些时间序列数据：

import pandas as pd
import numpy as np

dr = pd.date_range('01-01-2020 5:00', periods=10, freq='H')
df = pd.DataFrame({'Bools':[True,True,False,False,False,True,True,np.nan,np.nan,False],
                   "Nums":range(10)},
                  index=dr)

所以数据看起来像：

                     Bools  Nums
2020-01-01 05:00:00   True     0
2020-01-01 06:00:00   True     1
2020-01-01 07:00:00  False     2
2020-01-01 08:00:00  False     3
2020-01-01 09:00:00  False     4
2020-01-01 10:00:00   True     5
2020-01-01 11:00:00   True     6
2020-01-01 12:00:00    NaN     7
2020-01-01 13:00:00    NaN     8
2020-01-01 14:00:00  False     9

我原以为我可以在重采样时对布尔列进行简单的操作（如求和），但是（按原样）失败了：

>>> df.resample('5H').sum()

                    Nums
2020-01-01 05:00:00    10
2020-01-01 10:00:00    35

删除“布尔”列。我对为什么会发生这种情况的印象是 b/c 该列的 dtype 是 object。改变解决问题：

>>> r = df.resample('5H')
>>> copy = df.copy() #just doing this to preserve df for the example
>>> copy['Bools'] = copy['Bools'].astype(float)
>>> copy.resample('5H').sum()

                     Bools  Nums
2020-01-01 05:00:00    2.0    10
2020-01-01 10:00:00    2.0    35

但是（奇怪的是）您可以仍然通过索引重采样对象而不更改 dtype:

来对布尔值求和

>>> r = df.resample('5H')
>>> r['Bools'].sum()

2020-01-01 05:00:00    2
2020-01-01 10:00:00    2
Freq: 5H, Name: Bools, dtype: int64

而且如果唯一的列是布尔值，您仍然可以重新采样（尽管该列仍然是 object）：

>>> df.drop(['Nums'],axis=1).resample('5H').sum()

                    Bools
2020-01-01 05:00:00      2
2020-01-01 10:00:00      2

是什么让后两个示例起作用？我可以看出它们可能更明确一点（“拜托，我真的想重新采样此列！”），但我不明白为什么原来的 resample 没有' 能做就允许操作

Answer 1

df.resample('5H').sum() 不适用于 Bools 列，因为该列具有混合数据类型，即 object in pandas。在 resample 或 groupby 上调用 sum() 时，object 类型的列将被忽略。

Answer 2

嗯，追查表明：

df.resample('5H')['Bools'].sum == Groupby.sum (in pd.core.groupby.generic.SeriesGroupBy)

df.resample('5H').sum == sum (in pandas.core.resample.DatetimeIndexResampler)

并在 groupby.py 中跟踪 groupby_function 表明它等同于 r.agg(lambda x: np.sum(x, axis=r.axis)) 其中 r = df.resample('5H') 输出：

                     Bools  Nums  Nums2
2020-01-01 05:00:00      2    10     10
2020-01-01 10:00:00      2    35     35

嗯，其实应该是r = df.resample('5H')['Bool']（仅针对上述情况）

并追踪 resample.py 中的 _downsample 函数表明它等同于： df.groupby(r.grouper, axis=r.axis).agg(np.sum) 输出：

                     Nums  Nums2
2020-01-01 05:00:00    10     10
2020-01-01 10:00:00    35     35

pandas 中的布尔值重采样

Resampling boolean values in pandas

python

boolean

pandas

pandas-resample