如何根据另一列中的日期值范围创建排名列?
How to create a ranking column based on date value range in another column?
data = [
["Item_1", "2020-06-01"],
["Item_1", "2020-06-02"],
["Item_1", "2020-05-27"],
["Item_2", "2018-04-15"],
["Item_2", "2018-04-18"],
["Item_2", "2018-04-22"],
["Item_2", "2018-04-28"],
]
df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
df
我有一个包含 Item IDs
和 Dates
列的数据集。我想在新列中分配排序的“排名”,其中 rank/order 值增加 IF 下一个日期距前一个日期 >3 天,否则它保持不变。
因此所需的输出将如下所示:
Item_ID Dates Date Order
Item_1 2020-05-27 1
Item_1 2020-06-01 2
Item_1 2020-06-02 2
Item_2 2018-04-15 1
Item_2 2018-04-18 1
Item_2 2018-04-22 2
Item_2 2018-04-28 3
我们可以使用 groupby apply
来计算每组天数之间的差异,然后使用 cumsum
来“计算”有多少差异大于 (`gt) 3 天:
# Convert to datetime (if not already)
df['Dates'] = pd.to_datetime(df['Dates'])
# Sort in correct order
df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
# Calculate Ranking per Group
df['Date Order'] = (
df.groupby('Item_ID')['Dates'].apply(
lambda s: s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
)
)
也可以 groupby
两次并使用 groupby diff
and groupby cumsum
:
# Convert to datetime (if not already)
df['Dates'] = pd.to_datetime(df['Dates'])
# Sort in correct order
df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
# Reuse same Grouper
g = df.groupby('Item_ID')
# Calculate Difference per group and compare (whole Series)
df['Date Order'] = g['Dates'].diff().gt(pd.Timedelta(days=3))
# Calculate cumsum per group
df['Date Order'] = g['Date Order'].cumsum() + 1
两者都产生 df
:
Item_ID Dates Date Order
0 Item_1 2020-05-27 1
1 Item_1 2020-06-01 2
2 Item_1 2020-06-02 2
3 Item_2 2018-04-15 1
4 Item_2 2018-04-18 1
5 Item_2 2018-04-22 2
6 Item_2 2018-04-28 3
以下是作为 DataFrame 的每组步骤的细分:
s = pd.Series([pd.Timestamp('2020-05-27 00:00:00'),
pd.Timestamp('2020-06-01 00:00:00'),
pd.Timestamp('2020-06-02 00:00:00')],
name='Dates',
index=pd.Series({0: 'Item_1', 1: 'Item_1', 2: 'Item_1'},
name='Item_ID'))
steps_per_group = pd.DataFrame({
'diff': s.diff(),
'gt': s.diff().gt(pd.Timedelta(days=3)),
'cumsum': s.diff().gt(pd.Timedelta(days=3)).cumsum(),
'cumsum 1 start': s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
})
diff gt cumsum cumsum 1 start
Item_ID
Item_1 NaT False 0 1
Item_1 5 days True 1 2
Item_1 1 days False 1 2
来自您的 DataFrame
:
>>> import pandas as pd
>>> data = [
... ["Item_1", "2020-05-27"],
... ["Item_1", "2020-06-01"],
... ["Item_1", "2020-06-02"],
... ["Item_2", "2018-04-15"],
... ["Item_2", "2018-04-18"],
... ["Item_2", "2018-04-22"],
... ["Item_2", "2018-04-28"],
... ]
>>> df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
>>> df['Dates'] = pd.to_datetime(df['Dates'], format="%Y-%m-%d")
>>> df
Item_ID Dates
0 Item_1 2020-05-27
1 Item_1 2020-06-01
2 Item_1 2020-06-02
3 Item_2 2018-04-15
4 Item_2 2018-04-18
5 Item_2 2018-04-22
6 Item_2 2018-04-28
我们可以得到按 Item_ID
分组的日期 diff
,如下所示:
>>> window_size = 3
>>> df['diff'] = df.groupby('Item_ID')["Dates"].diff().dt.days.gt(window_size)
>>> df
Item_ID Dates diff
0 Item_1 2020-05-27 False
1 Item_1 2020-06-01 True
2 Item_1 2020-06-02 False
3 Item_2 2018-04-15 False
4 Item_2 2018-04-18 False
5 Item_2 2018-04-22 True
6 Item_2 2018-04-28 True
然后,通过 Item_ID
再次分组并应用 cumsum
,我们得到预期的结果:
>>> df['Date Order'] = df.groupby('Item_ID')["diff"].cumsum()+1
>>> df
Item_ID Dates diff Date Order
0 Item_1 2020-05-27 False 1
1 Item_1 2020-06-01 True 2
2 Item_1 2020-06-02 False 2
3 Item_2 2018-04-15 False 1
4 Item_2 2018-04-18 False 1
5 Item_2 2018-04-22 True 2
6 Item_2 2018-04-28 True 3
data = [
["Item_1", "2020-06-01"],
["Item_1", "2020-06-02"],
["Item_1", "2020-05-27"],
["Item_2", "2018-04-15"],
["Item_2", "2018-04-18"],
["Item_2", "2018-04-22"],
["Item_2", "2018-04-28"],
]
df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
df
我有一个包含 Item IDs
和 Dates
列的数据集。我想在新列中分配排序的“排名”,其中 rank/order 值增加 IF 下一个日期距前一个日期 >3 天,否则它保持不变。
因此所需的输出将如下所示:
Item_ID Dates Date Order
Item_1 2020-05-27 1
Item_1 2020-06-01 2
Item_1 2020-06-02 2
Item_2 2018-04-15 1
Item_2 2018-04-18 1
Item_2 2018-04-22 2
Item_2 2018-04-28 3
我们可以使用 groupby apply
来计算每组天数之间的差异,然后使用 cumsum
来“计算”有多少差异大于 (`gt) 3 天:
# Convert to datetime (if not already)
df['Dates'] = pd.to_datetime(df['Dates'])
# Sort in correct order
df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
# Calculate Ranking per Group
df['Date Order'] = (
df.groupby('Item_ID')['Dates'].apply(
lambda s: s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
)
)
也可以 groupby
两次并使用 groupby diff
and groupby cumsum
:
# Convert to datetime (if not already)
df['Dates'] = pd.to_datetime(df['Dates'])
# Sort in correct order
df = df.sort_values(['Item_ID', 'Dates'], ignore_index=True)
# Reuse same Grouper
g = df.groupby('Item_ID')
# Calculate Difference per group and compare (whole Series)
df['Date Order'] = g['Dates'].diff().gt(pd.Timedelta(days=3))
# Calculate cumsum per group
df['Date Order'] = g['Date Order'].cumsum() + 1
两者都产生 df
:
Item_ID Dates Date Order
0 Item_1 2020-05-27 1
1 Item_1 2020-06-01 2
2 Item_1 2020-06-02 2
3 Item_2 2018-04-15 1
4 Item_2 2018-04-18 1
5 Item_2 2018-04-22 2
6 Item_2 2018-04-28 3
以下是作为 DataFrame 的每组步骤的细分:
s = pd.Series([pd.Timestamp('2020-05-27 00:00:00'),
pd.Timestamp('2020-06-01 00:00:00'),
pd.Timestamp('2020-06-02 00:00:00')],
name='Dates',
index=pd.Series({0: 'Item_1', 1: 'Item_1', 2: 'Item_1'},
name='Item_ID'))
steps_per_group = pd.DataFrame({
'diff': s.diff(),
'gt': s.diff().gt(pd.Timedelta(days=3)),
'cumsum': s.diff().gt(pd.Timedelta(days=3)).cumsum(),
'cumsum 1 start': s.diff().gt(pd.Timedelta(days=3)).cumsum() + 1
})
diff gt cumsum cumsum 1 start
Item_ID
Item_1 NaT False 0 1
Item_1 5 days True 1 2
Item_1 1 days False 1 2
来自您的 DataFrame
:
>>> import pandas as pd
>>> data = [
... ["Item_1", "2020-05-27"],
... ["Item_1", "2020-06-01"],
... ["Item_1", "2020-06-02"],
... ["Item_2", "2018-04-15"],
... ["Item_2", "2018-04-18"],
... ["Item_2", "2018-04-22"],
... ["Item_2", "2018-04-28"],
... ]
>>> df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
>>> df['Dates'] = pd.to_datetime(df['Dates'], format="%Y-%m-%d")
>>> df
Item_ID Dates
0 Item_1 2020-05-27
1 Item_1 2020-06-01
2 Item_1 2020-06-02
3 Item_2 2018-04-15
4 Item_2 2018-04-18
5 Item_2 2018-04-22
6 Item_2 2018-04-28
我们可以得到按 Item_ID
分组的日期 diff
,如下所示:
>>> window_size = 3
>>> df['diff'] = df.groupby('Item_ID')["Dates"].diff().dt.days.gt(window_size)
>>> df
Item_ID Dates diff
0 Item_1 2020-05-27 False
1 Item_1 2020-06-01 True
2 Item_1 2020-06-02 False
3 Item_2 2018-04-15 False
4 Item_2 2018-04-18 False
5 Item_2 2018-04-22 True
6 Item_2 2018-04-28 True
然后,通过 Item_ID
再次分组并应用 cumsum
,我们得到预期的结果:
>>> df['Date Order'] = df.groupby('Item_ID')["diff"].cumsum()+1
>>> df
Item_ID Dates diff Date Order
0 Item_1 2020-05-27 False 1
1 Item_1 2020-06-01 True 2
2 Item_1 2020-06-02 False 2
3 Item_2 2018-04-15 False 1
4 Item_2 2018-04-18 False 1
5 Item_2 2018-04-22 True 2
6 Item_2 2018-04-28 True 3