如何根据过滤计算数据框列中的值
how to count value in a dataframe column based on filtering
鉴于此数据框:
DriverId time SPEED
0 2021-04-16 21:40:00+00:00 58.500000
2021-04-16 21:41:00+00:00 32.850000
2021-04-16 21:42:00+00:00 89.633333
2021-04-16 21:43:00+00:00 88.166667
2021-04-16 21:44:00+00:00 118.016667
... ... ...
88 2021-04-27 07:30:00+00:00 79.566667
2021-04-27 07:31:00+00:00 59.383333
2021-04-27 07:32:00+00:00 89.133333
2021-04-27 07:33:00+00:00 59.966667
2021-04-27 07:34:00+00:00 25.72413
我想添加列来计算每个车手的速度低于 40 km/h 的数量,所以我试过了:
y[y.SPEED<40].count()
它显示了这个:
SPEED 4721
dtype: int64
这不是我想要的,expexted结果必须是这样的:
DriverId SPEED count
0 15.20 2
32.850000
89.633333
88.166667
118.016667
... ... ...
88 79.566667 1
59.383333
89.133333
59.966667
25.72413
我的数据框是一个系列,我将其转换为数据框
y.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15082 entries, (0, Timestamp('2021-04-16 21:40:00+0000', tz='UTC')) to (88, Timestamp('2021-04-27 07:34:00+0000', tz='UTC'))
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SPEED 15082 non-null float64
dtypes: float64(1)
memory usage: 922.5 KB
首先,我会在每一行中都有 DriverId 而不是仅在组的第一行中,然后尝试以下操作:
y["Count of speed<40 for given driver"]=[sum((y.Driver==x) & (y["Speed"]<40)) for x in y.Driver]
df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
['0','2021-04-16 21:43:00+00:00', 88.166667],
['0','2021-04-16 21:44:00+00:00',118.016667],
['88','[2021-04-27 07:30:00+00:00',79.566667],
['88','2021-04-27 07:31:00+00:00',59.383333],
['88','2021-04-27 07:32:00+00:00',89.133333],
['88','2021-04-27 07:33:00+00:00',59.966667],
['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
],columns=['driver_id','time','speed'])
df = df.set_index("driver_id")
counts = df[df['speed'] < 40].groupby(["driver_id",],as_index=False).agg(
count_col=pd.NamedAgg(column="speed", aggfunc="count")
)
merged_Frame = pd.merge(df, counts, on = 'driver_id', how='inner')
输出
driver_id time speed count_col
0 0 2021-04-16 21:40:00+00:00 58.500000 2
1 0 2021-04-16 21:41:00+00:00 32.850000 2
2 0 2021-04-16 21:42:00+00:00 15.633333 2
3 0 2021-04-16 21:43:00+00:00 88.166667 2
4 0 2021-04-16 21:44:00+00:00 118.016667 2
5 88 [2021-04-27 07:30:00+00:00 79.566667 1
6 88 2021-04-27 07:31:00+00:00 59.383333 1
7 88 2021-04-27 07:32:00+00:00 89.133333 1
8 88 2021-04-27 07:33:00+00:00 59.966667 1
9 88 2021-04-27 07:34:00+00:00 25.724130 1
参考
编辑
import pandas as pd
df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
['0','2021-04-16 21:43:00+00:00', 88.166667],
['0','2021-04-16 21:44:00+00:00',118.016667],
['88','[2021-04-27 07:30:00+00:00',79.566667],
['88','2021-04-27 07:31:00+00:00',59.383333],
['88','2021-04-27 07:32:00+00:00',89.133333],
['88','2021-04-27 07:33:00+00:00',59.966667],
['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
],columns=['driver_id','time','speed'])
df = df.set_index(['driver_id', 'time'])
df['count'] = df[df['speed'] < 40].groupby('driver_id')['speed'].transform('count')
输出
鉴于此数据框:
DriverId time SPEED
0 2021-04-16 21:40:00+00:00 58.500000
2021-04-16 21:41:00+00:00 32.850000
2021-04-16 21:42:00+00:00 89.633333
2021-04-16 21:43:00+00:00 88.166667
2021-04-16 21:44:00+00:00 118.016667
... ... ...
88 2021-04-27 07:30:00+00:00 79.566667
2021-04-27 07:31:00+00:00 59.383333
2021-04-27 07:32:00+00:00 89.133333
2021-04-27 07:33:00+00:00 59.966667
2021-04-27 07:34:00+00:00 25.72413
我想添加列来计算每个车手的速度低于 40 km/h 的数量,所以我试过了:
y[y.SPEED<40].count()
它显示了这个:
SPEED 4721
dtype: int64
这不是我想要的,expexted结果必须是这样的:
DriverId SPEED count
0 15.20 2
32.850000
89.633333
88.166667
118.016667
... ... ...
88 79.566667 1
59.383333
89.133333
59.966667
25.72413
我的数据框是一个系列,我将其转换为数据框
y.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15082 entries, (0, Timestamp('2021-04-16 21:40:00+0000', tz='UTC')) to (88, Timestamp('2021-04-27 07:34:00+0000', tz='UTC'))
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SPEED 15082 non-null float64
dtypes: float64(1)
memory usage: 922.5 KB
首先,我会在每一行中都有 DriverId 而不是仅在组的第一行中,然后尝试以下操作:
y["Count of speed<40 for given driver"]=[sum((y.Driver==x) & (y["Speed"]<40)) for x in y.Driver]
df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
['0','2021-04-16 21:43:00+00:00', 88.166667],
['0','2021-04-16 21:44:00+00:00',118.016667],
['88','[2021-04-27 07:30:00+00:00',79.566667],
['88','2021-04-27 07:31:00+00:00',59.383333],
['88','2021-04-27 07:32:00+00:00',89.133333],
['88','2021-04-27 07:33:00+00:00',59.966667],
['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
],columns=['driver_id','time','speed'])
df = df.set_index("driver_id")
counts = df[df['speed'] < 40].groupby(["driver_id",],as_index=False).agg(
count_col=pd.NamedAgg(column="speed", aggfunc="count")
)
merged_Frame = pd.merge(df, counts, on = 'driver_id', how='inner')
输出
driver_id time speed count_col
0 0 2021-04-16 21:40:00+00:00 58.500000 2
1 0 2021-04-16 21:41:00+00:00 32.850000 2
2 0 2021-04-16 21:42:00+00:00 15.633333 2
3 0 2021-04-16 21:43:00+00:00 88.166667 2
4 0 2021-04-16 21:44:00+00:00 118.016667 2
5 88 [2021-04-27 07:30:00+00:00 79.566667 1
6 88 2021-04-27 07:31:00+00:00 59.383333 1
7 88 2021-04-27 07:32:00+00:00 89.133333 1
8 88 2021-04-27 07:33:00+00:00 59.966667 1
9 88 2021-04-27 07:34:00+00:00 25.724130 1
参考
编辑
import pandas as pd
df = pd.DataFrame([['0','2021-04-16 21:40:00+00:00',58.500000],
['0','2021-04-16 21:41:00+00:00', 32.850000],#FIRST ONE
['0','2021-04-16 21:42:00+00:00', 15.633333],#SECOND ONE
['0','2021-04-16 21:43:00+00:00', 88.166667],
['0','2021-04-16 21:44:00+00:00',118.016667],
['88','[2021-04-27 07:30:00+00:00',79.566667],
['88','2021-04-27 07:31:00+00:00',59.383333],
['88','2021-04-27 07:32:00+00:00',89.133333],
['88','2021-04-27 07:33:00+00:00',59.966667],
['88','2021-04-27 07:34:00+00:00',25.72413] # THIRD ONE
],columns=['driver_id','time','speed'])
df = df.set_index(['driver_id', 'time'])
df['count'] = df[df['speed'] < 40].groupby('driver_id')['speed'].transform('count')
输出