为每个 customerId 找到最接近的日期并计算时间跨度
Find the closest date for each customerId and calculate the time span
我有问题。我有一个包含 customerId
和日期 fromDate
的数据框。现在我想为每个客户单独计算下一次交货的时间。例如,我有 customerId = 1
的客户,他在 2021-03-18
买了东西 我现在想找到下一个日期并以天为单位输出这个距离,例如2021-03-22
和 4 days
。简单来说,我想计算 the next date in the future - from Date
或 n - (n-1)
。除非日期有下一个日期,否则应该是 None
,例如2022-01-18
应该是 None
.
我有一个问题,我得到了很多None
值,而且,我应该分别查看每个客户。我该怎么做?
数学示例
n - (n-1) = next_day_in_days
e.g.
2021-03-22 - 2021-03-18 = 4
[OUT]
customerId fromDate next_day_in_days
1 1 2021-03-18 4
数据框
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
代码
import pandas as pd
import datetime
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
print(df)
def nearest(items, pivot):
try:
return min(items, key=lambda x: abs(x - pivot))
except:
return None
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
df["next_day_in_days"] = df['fromDate'].apply(lambda x: nearest(df['fromDate'], x))
输出
[OUT]
customerId fromDate next_in_days
0 1 2021-02-22 None
1 1 2021-03-18 None
2 1 2021-03-22 None
3 1 2021-02-10 None
4 1 2021-09-07 None
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 None
9 3 2021-07-17 None
10 3 2021-02-22 None
11 3 2021-02-22 None
Name: next_in_days, dtype: object
我想要的
customerId fromDate next_day_in_days
0 1 2021-02-22 24
1 1 2021-03-18 4
2 1 2021-03-22 109
3 1 2021-02-10 12
4 1 2021-09-07 133
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 61
9 3 2021-07-17 None
10 3 2021-02-22 133
11 3 2021-02-22 133
首先根据 customerId
和 fromDate
对列进行排序,因为可能的重复项会按相同的列删除它们,因此可能使用 DataFrameGroupBy.diff
with Series.dt.days
:
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df = df.sort_values(['customerId','fromDate'])
df['next_day_in_days'] = (df.drop_duplicates(['customerId','fromDate'])
.groupby('customerId')['fromDate']
.diff(-1)
.dt.days
.abs())
如有必要,获取索引的原始排序。
df = df.sort_index()
每个 ['customerId', 'fromDate']
的最后一个重复值,这里是 GroupBy.ffill
的最后一个值 84.0
:
df['next_day_in_days'] = df.groupby(['customerId', 'fromDate'])['next_day_in_days'].ffill()
print (df)
customerId fromDate next_day_in_days
0 1 2021-02-22 24.0
1 1 2021-03-18 4.0
2 1 2021-03-22 169.0
3 1 2021-02-10 12.0
4 1 2021-09-07 133.0
5 1 NaT NaN
6 1 2022-01-18 NaN
7 2 2021-05-17 NaN
8 3 2021-05-17 61.0
9 3 2021-07-17 NaN
10 3 2021-02-22 84.0
11 3 2021-02-22 84.0
我有问题。我有一个包含 customerId
和日期 fromDate
的数据框。现在我想为每个客户单独计算下一次交货的时间。例如,我有 customerId = 1
的客户,他在 2021-03-18
买了东西 我现在想找到下一个日期并以天为单位输出这个距离,例如2021-03-22
和 4 days
。简单来说,我想计算 the next date in the future - from Date
或 n - (n-1)
。除非日期有下一个日期,否则应该是 None
,例如2022-01-18
应该是 None
.
我有一个问题,我得到了很多None
值,而且,我应该分别查看每个客户。我该怎么做?
数学示例
n - (n-1) = next_day_in_days
e.g.
2021-03-22 - 2021-03-18 = 4
[OUT]
customerId fromDate next_day_in_days
1 1 2021-03-18 4
数据框
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
代码
import pandas as pd
import datetime
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
print(df)
def nearest(items, pivot):
try:
return min(items, key=lambda x: abs(x - pivot))
except:
return None
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
df["next_day_in_days"] = df['fromDate'].apply(lambda x: nearest(df['fromDate'], x))
输出
[OUT]
customerId fromDate next_in_days
0 1 2021-02-22 None
1 1 2021-03-18 None
2 1 2021-03-22 None
3 1 2021-02-10 None
4 1 2021-09-07 None
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 None
9 3 2021-07-17 None
10 3 2021-02-22 None
11 3 2021-02-22 None
Name: next_in_days, dtype: object
我想要的
customerId fromDate next_day_in_days
0 1 2021-02-22 24
1 1 2021-03-18 4
2 1 2021-03-22 109
3 1 2021-02-10 12
4 1 2021-09-07 133
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 61
9 3 2021-07-17 None
10 3 2021-02-22 133
11 3 2021-02-22 133
首先根据 customerId
和 fromDate
对列进行排序,因为可能的重复项会按相同的列删除它们,因此可能使用 DataFrameGroupBy.diff
with Series.dt.days
:
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df = df.sort_values(['customerId','fromDate'])
df['next_day_in_days'] = (df.drop_duplicates(['customerId','fromDate'])
.groupby('customerId')['fromDate']
.diff(-1)
.dt.days
.abs())
如有必要,获取索引的原始排序。
df = df.sort_index()
每个 ['customerId', 'fromDate']
的最后一个重复值,这里是 GroupBy.ffill
的最后一个值 84.0
:
df['next_day_in_days'] = df.groupby(['customerId', 'fromDate'])['next_day_in_days'].ffill()
print (df)
customerId fromDate next_day_in_days
0 1 2021-02-22 24.0
1 1 2021-03-18 4.0
2 1 2021-03-22 169.0
3 1 2021-02-10 12.0
4 1 2021-09-07 133.0
5 1 NaT NaN
6 1 2022-01-18 NaN
7 2 2021-05-17 NaN
8 3 2021-05-17 61.0
9 3 2021-07-17 NaN
10 3 2021-02-22 84.0
11 3 2021-02-22 84.0