使用一个月中的特定日期获取另一列的总数?
Getting total count of a another column using a specific day in a month?
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count PULocationID DOLocationID fare_amount
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 238 239 6.0
1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 239 238 7.0
2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 238 238 6.0
3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 238 151 5.5
4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 193 193 3.5
5 2.0 2020-01-01 00:09:44 2020-01-01 00:10:37 1.0 7 193 2.5
6 2.0 2020-01-01 00:39:25 2020-01-01 00:39:29 1.0 193 193 2.5
7 1.0 2020-01-01 00:29:01 2020-01-01 00:40:28 2.0 246 48 8.0
8 1.0 2020-01-01 00:55:11 2020-01-01 01:12:03 2.0 246 79 12.0
9 1.0 2020-01-01 00:37:15 2020-01-01 00:51:41 1.0 163 161 9.5
我有 2020 年 1 月的数据(跨越整个月,这只是一个片段),我想回答 'Saturday is the busiest day in terms of passenger pickups.' 这样的问题
我该怎么做?
带有标签 'tpep_pickup_datetime' 和 'tpep_dropoff_datetime' 的列的数据类型是对象类型。
为了更好的样本,tpep_pickup_datetime
列中的第一个数据针对不同的日期时间进行了更改:
print (df)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0
1 1.0 2020-01-02 00:35:39 2020-01-01 00:43:04 1.0
2 1.0 2020-01-02 00:47:41 2020-01-01 00:53:52 1.0
3 1.0 2020-01-03 00:55:23 2020-01-01 01:00:14 1.0
4 2.0 2020-01-03 00:01:58 2020-01-01 00:04:16 1.0
5 2.0 2020-01-03 00:09:44 2020-01-01 00:10:37 1.0
6 2.0 2020-01-04 00:39:25 2020-01-01 00:39:29 1.0
7 1.0 2020-01-04 00:29:01 2020-01-01 00:40:28 2.0
8 1.0 2020-01-04 00:55:11 2020-01-01 01:12:03 2.0
9 1.0 2020-01-05 00:37:15 2020-01-01 00:51:41 1.0
PULocationID DOLocationID fare_amount
0 238 239 6.0
1 239 238 7.0
2 238 238 6.0
3 238 151 5.5
4 193 193 3.5
5 7 193 2.5
6 193 193 2.5
7 246 48 8.0
8 246 79 12.0
9 163 161 9.5
将列转换为日期时间,通过 Series.dt.day_name
获取日期名称并汇总 sum
:
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['day'] = df['tpep_pickup_datetime'].dt.day_name()
s = df.groupby('day')['passenger_count'].sum()
print (s)
day
Friday 3.0
Saturday 5.0
Sunday 1.0
Thursday 2.0
Wednesday 1.0
Name: passenger_count, dtype: float64
然后对于索引,这里最大值使用Series.idxmax
,对于最大值使用max
:
print (s.idxmax())
Saturday
print (s.max())
5.0
如果需要两者都可以使用 Series.agg
:
print (s.agg(['idxmax','max']))
idxmax Saturday
max 5
Name: passenger_count, dtype: object
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count PULocationID DOLocationID fare_amount
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 238 239 6.0
1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 239 238 7.0
2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 238 238 6.0
3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 238 151 5.5
4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 193 193 3.5
5 2.0 2020-01-01 00:09:44 2020-01-01 00:10:37 1.0 7 193 2.5
6 2.0 2020-01-01 00:39:25 2020-01-01 00:39:29 1.0 193 193 2.5
7 1.0 2020-01-01 00:29:01 2020-01-01 00:40:28 2.0 246 48 8.0
8 1.0 2020-01-01 00:55:11 2020-01-01 01:12:03 2.0 246 79 12.0
9 1.0 2020-01-01 00:37:15 2020-01-01 00:51:41 1.0 163 161 9.5
我有 2020 年 1 月的数据(跨越整个月,这只是一个片段),我想回答 'Saturday is the busiest day in terms of passenger pickups.' 这样的问题 我该怎么做? 带有标签 'tpep_pickup_datetime' 和 'tpep_dropoff_datetime' 的列的数据类型是对象类型。
为了更好的样本,tpep_pickup_datetime
列中的第一个数据针对不同的日期时间进行了更改:
print (df)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0
1 1.0 2020-01-02 00:35:39 2020-01-01 00:43:04 1.0
2 1.0 2020-01-02 00:47:41 2020-01-01 00:53:52 1.0
3 1.0 2020-01-03 00:55:23 2020-01-01 01:00:14 1.0
4 2.0 2020-01-03 00:01:58 2020-01-01 00:04:16 1.0
5 2.0 2020-01-03 00:09:44 2020-01-01 00:10:37 1.0
6 2.0 2020-01-04 00:39:25 2020-01-01 00:39:29 1.0
7 1.0 2020-01-04 00:29:01 2020-01-01 00:40:28 2.0
8 1.0 2020-01-04 00:55:11 2020-01-01 01:12:03 2.0
9 1.0 2020-01-05 00:37:15 2020-01-01 00:51:41 1.0
PULocationID DOLocationID fare_amount
0 238 239 6.0
1 239 238 7.0
2 238 238 6.0
3 238 151 5.5
4 193 193 3.5
5 7 193 2.5
6 193 193 2.5
7 246 48 8.0
8 246 79 12.0
9 163 161 9.5
将列转换为日期时间,通过 Series.dt.day_name
获取日期名称并汇总 sum
:
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['day'] = df['tpep_pickup_datetime'].dt.day_name()
s = df.groupby('day')['passenger_count'].sum()
print (s)
day
Friday 3.0
Saturday 5.0
Sunday 1.0
Thursday 2.0
Wednesday 1.0
Name: passenger_count, dtype: float64
然后对于索引,这里最大值使用Series.idxmax
,对于最大值使用max
:
print (s.idxmax())
Saturday
print (s.max())
5.0
如果需要两者都可以使用 Series.agg
:
print (s.agg(['idxmax','max']))
idxmax Saturday
max 5
Name: passenger_count, dtype: object