如何根据过滤计算数据框列中的值

how to count value in a dataframe column based on filtering

鉴于此数据框:

DriverId    time                         SPEED
0           2021-04-16 21:40:00+00:00   58.500000
            2021-04-16 21:41:00+00:00   32.850000
            2021-04-16 21:42:00+00:00   89.633333
            2021-04-16 21:43:00+00:00   88.166667
            2021-04-16 21:44:00+00:00   118.016667
... ... ...
88          2021-04-27 07:30:00+00:00   79.566667
            2021-04-27 07:31:00+00:00   59.383333
            2021-04-27 07:32:00+00:00   89.133333
            2021-04-27 07:33:00+00:00   59.966667
            2021-04-27 07:34:00+00:00   25.72413

我想添加列来计算每个车手的速度低于 40 km/h 的数量,所以我试过了:

y[y.SPEED<40].count()

它显示了这个:

    SPEED    4721
    dtype: int64

这不是我想要的,expexted结果必须是这样的:

  DriverId        SPEED         count 
      0            15.20            2
                   32.850000 
                   89.633333
                  88.166667
                  118.016667
... ... ...
88              79.566667          1
                59.383333
                89.133333
                59.966667
                25.72413

我的数据框是一个系列,我将其转换为数据框

 y.info()
    <class 'pandas.core.frame.DataFrame'>
MultiIndex: 15082 entries, (0, Timestamp('2021-04-16 21:40:00+0000', tz='UTC')) to (88, Timestamp('2021-04-27 07:34:00+0000', tz='UTC'))
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SPEED   15082 non-null  float64
dtypes: float64(1)
memory usage: 922.5 KB

首先,我会在每一行中都有 DriverId 而不是仅在组的第一行中,然后尝试以下操作:

y["Count of speed<40 for given driver"]=[sum((y.Driver==x) & (y["Speed"]<40)) for x in y.Driver]
df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
    ['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
    ['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
    ['0','2021-04-16 21:43:00+00:00', 88.166667],
    ['0','2021-04-16 21:44:00+00:00',118.016667],
    ['88','[2021-04-27 07:30:00+00:00',79.566667],
    ['88','2021-04-27 07:31:00+00:00',59.383333],
    ['88','2021-04-27 07:32:00+00:00',89.133333],
    ['88','2021-04-27 07:33:00+00:00',59.966667],
    ['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
  ],columns=['driver_id','time','speed'])
df = df.set_index("driver_id")
counts = df[df['speed'] < 40].groupby(["driver_id",],as_index=False).agg(
    count_col=pd.NamedAgg(column="speed", aggfunc="count")
)
merged_Frame = pd.merge(df, counts, on = 'driver_id', how='inner')

输出

driver_id   time                   speed        count_col
0   0   2021-04-16 21:40:00+00:00   58.500000   2
1   0   2021-04-16 21:41:00+00:00   32.850000   2
2   0   2021-04-16 21:42:00+00:00   15.633333   2
3   0   2021-04-16 21:43:00+00:00   88.166667   2
4   0   2021-04-16 21:44:00+00:00   118.016667  2
5   88  [2021-04-27 07:30:00+00:00  79.566667   1
6   88  2021-04-27 07:31:00+00:00   59.383333   1
7   88  2021-04-27 07:32:00+00:00   89.133333   1
8   88  2021-04-27 07:33:00+00:00   59.966667   1
9   88  2021-04-27 07:34:00+00:00   25.724130   1

参考

  1. pd.NamedAgg

编辑

import pandas as pd

df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
    ['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
    ['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
    ['0','2021-04-16 21:43:00+00:00', 88.166667],
    ['0','2021-04-16 21:44:00+00:00',118.016667],
    ['88','[2021-04-27 07:30:00+00:00',79.566667],
    ['88','2021-04-27 07:31:00+00:00',59.383333],
    ['88','2021-04-27 07:32:00+00:00',89.133333],
    ['88','2021-04-27 07:33:00+00:00',59.966667],
    ['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
  ],columns=['driver_id','time','speed'])
df = df.set_index(['driver_id', 'time'])
df['count'] = df[df['speed'] < 40].groupby('driver_id')['speed'].transform('count')

输出