为每个 customerId 找到最接近的日期并计算时间跨度

Find the closest date for each customerId and calculate the time span

我有问题。我有一个包含 customerId 和日期 fromDate 的数据框。现在我想为每个客户单独计算下一次交货的时间。例如,我有 customerId = 1 的客户,他在 2021-03-18 买了东西 我现在想找到下一个日期并以天为单位输出这个距离,例如2021-03-224 days。简单来说,我想计算 the next date in the future - from Daten - (n-1)。除非日期有下一个日期,否则应该是 None,例如2022-01-18 应该是 None.

我有一个问题,我得到了很多None值,而且,我应该分别查看每个客户。我该怎么做?

数学示例

n - (n-1) = next_day_in_days
e.g.
2021-03-22 - 2021-03-18 = 4
[OUT]
    customerId    fromDate next_day_in_days
1            1  2021-03-18         4

数据框

    customerId    fromDate
0            1  2021-02-22
1            1  2021-03-18
2            1  2021-03-22
3            1  2021-02-10
4            1  2021-09-07
5            1        None
6            1  2022-01-18
7            2  2021-05-17
8            3  2021-05-17
9            3  2021-07-17
10           3  2021-02-22
11           3  2021-02-22

代码

import pandas as pd
import datetime

d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
    }
df = pd.DataFrame(data=d)


print(df)
def nearest(items, pivot):
  try:
    return min(items, key=lambda x: abs(x - pivot))
  except:
    return None

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
df["next_day_in_days"] = df['fromDate'].apply(lambda x: nearest(df['fromDate'], x)) 

输出

[OUT]
    customerId    fromDate next_in_days
0            1  2021-02-22         None
1            1  2021-03-18         None
2            1  2021-03-22         None
3            1  2021-02-10         None
4            1  2021-09-07         None
5            1         NaT         None
6            1  2022-01-18         None
7            2  2021-05-17         None
8            3  2021-05-17         None
9            3  2021-07-17         None
10           3  2021-02-22         None
11           3  2021-02-22         None
Name: next_in_days, dtype: object

我想要的

    customerId    fromDate next_day_in_days
0            1  2021-02-22         24
1            1  2021-03-18         4
2            1  2021-03-22         109
3            1  2021-02-10         12
4            1  2021-09-07         133
5            1         NaT         None
6            1  2022-01-18         None
7            2  2021-05-17         None
8            3  2021-05-17         61
9            3  2021-07-17         None
10           3  2021-02-22         133
11           3  2021-02-22         133

首先根据 customerIdfromDate 对列进行排序,因为可能的重复项会按相同的列删除它们,因此可能使用 DataFrameGroupBy.diff with Series.dt.days:

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
    
df = df.sort_values(['customerId','fromDate'])
df['next_day_in_days'] = (df.drop_duplicates(['customerId','fromDate'])
                            .groupby('customerId')['fromDate']
                            .diff(-1)
                            .dt.days
                            .abs())

如有必要,获取索引的原始排序。

df = df.sort_index()

每个 ['customerId', 'fromDate'] 的最后一个重复值,这里是 GroupBy.ffill 的最后一个值 84.0

df['next_day_in_days'] = df.groupby(['customerId', 'fromDate'])['next_day_in_days'].ffill()
print (df)
    customerId   fromDate  next_day_in_days
0            1 2021-02-22              24.0
1            1 2021-03-18               4.0
2            1 2021-03-22             169.0
3            1 2021-02-10              12.0
4            1 2021-09-07             133.0
5            1        NaT               NaN
6            1 2022-01-18               NaN
7            2 2021-05-17               NaN
8            3 2021-05-17              61.0
9            3 2021-07-17               NaN
10           3 2021-02-22              84.0
11           3 2021-02-22              84.0