计算行之间的 timedelta 作为第二列的每个元素的最大时间和最小时间之间的差值
Calculate a timedelta between rows as the difference between the maxium and the minimum time for each element of a second column
我想根据该项目的第一个订单和最后一个订单之间的时间间隔来计算该项目的分布。
不过,要实现该目标,首先我必须获得每个项目的时间增量。
我的初始数据框有三列:"Order_ID"、"Order_DATE"、"Medium_ID",如下例:
df = pd.DataFrame({'Medium_ID': {0: '1359',
1: '1360',
2: '1359',
3: '1360',
4: '1360',
5: '1404',
6: '1381',
7: '1359',
8: '1419',
9: '1360'},
'Order_ID': {0: '1',
1: '2',
2: '3',
3: '4',
4: '5',
5: '6',
6: '7',
7: '8',
8: '9',
9: '10'},
'Order_DATE': {0: Timestamp('2008-04-21 00:00:00'),
1: Timestamp('2008-04-21 00:00:00'),
2: Timestamp('2008-04-21 00:00:00'),
3: Timestamp('2008-04-21 00:00:00'),
4: Timestamp('2008-04-22 00:00:00'),
5: Timestamp('2008-04-22 00:00:00'),
6: Timestamp('2008-04-23 00:00:00'),
7: Timestamp('2008-04-23 00:00:00'),
8: Timestamp('2008-04-23 00:00:00'),
9: Timestamp('2008-04-28 00:00:00')}}))
因为同一个 medium_ID 可以有多个 order_IDs,我首先尝试按列 "Medium_ID" 分组,但后来我不知道如何进化.
我想要一个包含两列的新数据框:"Medium_ID" 和 "Days_between_the_last_and_the_first-order",最后显示 "Days_between_the_last_and_the_first-order".[=11 系列的分布=]
最后一个订单日期和第一个订单日期之间的天,你可以试试这个。
grouped = (
df.drop("Order_ID", axis=1)
.sort_values(["Medium_ID", "Order_DATE"])
.groupby("Medium_ID")
.agg(["first", "last"])
)
grouped.columns = ["first_order_date", "last_order_date"]
grouped.reset_index(inplace=True)
grouped["days_between_last_and_first_order"] = (
grouped["last_order_date"] - grouped["first_order_date"]
).dt.days
grouped = grouped[["Medium_ID", "days_between_last_and_first_order"]]
或者,使用@Franco 的解决方案是,
grouped = df.groupby("Medium_ID")["Order_DATE"].apply(
lambda x: x.max() - x.min()
).to_frame().reset_index().rename(
{"Order_DATE": "days_between_last_and_first_order"}, axis=1
)
grouped["days_between_last_and_first_order"] = grouped["days_between_last_and_first_order"].dt.days
为了可视化分布,
grouped.hist(column="days_between_last_and_first_order")
您可以计算每件商品的第一次和最后一次订购之间的天数,例如:
df.groupby('Medium_ID').Order_DATE.apply(lambda x: x.max() - x.min())
这导致:
Medium_ID
1359 2 days
1360 7 days
1381 0 days
1404 0 days
1419 0 days
我想根据该项目的第一个订单和最后一个订单之间的时间间隔来计算该项目的分布。 不过,要实现该目标,首先我必须获得每个项目的时间增量。
我的初始数据框有三列:"Order_ID"、"Order_DATE"、"Medium_ID",如下例:
df = pd.DataFrame({'Medium_ID': {0: '1359',
1: '1360',
2: '1359',
3: '1360',
4: '1360',
5: '1404',
6: '1381',
7: '1359',
8: '1419',
9: '1360'},
'Order_ID': {0: '1',
1: '2',
2: '3',
3: '4',
4: '5',
5: '6',
6: '7',
7: '8',
8: '9',
9: '10'},
'Order_DATE': {0: Timestamp('2008-04-21 00:00:00'),
1: Timestamp('2008-04-21 00:00:00'),
2: Timestamp('2008-04-21 00:00:00'),
3: Timestamp('2008-04-21 00:00:00'),
4: Timestamp('2008-04-22 00:00:00'),
5: Timestamp('2008-04-22 00:00:00'),
6: Timestamp('2008-04-23 00:00:00'),
7: Timestamp('2008-04-23 00:00:00'),
8: Timestamp('2008-04-23 00:00:00'),
9: Timestamp('2008-04-28 00:00:00')}}))
因为同一个 medium_ID 可以有多个 order_IDs,我首先尝试按列 "Medium_ID" 分组,但后来我不知道如何进化.
我想要一个包含两列的新数据框:"Medium_ID" 和 "Days_between_the_last_and_the_first-order",最后显示 "Days_between_the_last_and_the_first-order".[=11 系列的分布=]
最后一个订单日期和第一个订单日期之间的天,你可以试试这个。
grouped = (
df.drop("Order_ID", axis=1)
.sort_values(["Medium_ID", "Order_DATE"])
.groupby("Medium_ID")
.agg(["first", "last"])
)
grouped.columns = ["first_order_date", "last_order_date"]
grouped.reset_index(inplace=True)
grouped["days_between_last_and_first_order"] = (
grouped["last_order_date"] - grouped["first_order_date"]
).dt.days
grouped = grouped[["Medium_ID", "days_between_last_and_first_order"]]
或者,使用@Franco 的解决方案是,
grouped = df.groupby("Medium_ID")["Order_DATE"].apply(
lambda x: x.max() - x.min()
).to_frame().reset_index().rename(
{"Order_DATE": "days_between_last_and_first_order"}, axis=1
)
grouped["days_between_last_and_first_order"] = grouped["days_between_last_and_first_order"].dt.days
为了可视化分布,
grouped.hist(column="days_between_last_and_first_order")
您可以计算每件商品的第一次和最后一次订购之间的天数,例如:
df.groupby('Medium_ID').Order_DATE.apply(lambda x: x.max() - x.min())
这导致:
Medium_ID
1359 2 days
1360 7 days
1381 0 days
1404 0 days
1419 0 days