计算一些日期题
Calculate some date questions
我有问题。我想回答一些问题(见下文)。不幸的是我得到了一个错误 ValueError: Wrong number of items passed 0, placement implies 1
。我如何确定问题?
- 最后一次互动是多少天前(从今天算起)?
- 客户下了多少订单?
- 最快的交互是什么时候?
- 最短的互动时间是什么时候?
- 平均互动度是多少? (为此我计算了天数)
From 2021-02-10 to 2021-02-22 = 12
From 2021-02-22 to 2021-03-18 = 24
From 2021-03-18 to 2021-03-22 = 4
From 2021-03-22 to 2021-09-07 = 109
From 2021-09-07 to 2022-01-18 = 193
-------
68,4 (average days)
数据框
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
代码
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17']
}
df = pd.DataFrame(data=d)
display(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
#df = df['fromDate'].groupby('customerId', group_keys=False)
df = df.sort_values(['customerId','fromDate'],ascending=False)#.groupby('customerId')
df_new = pd.DataFrame()
df_new['average'] = df.groupby('customerId').mean()
[OUT] AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
df_new = pd.DataFrame()
df_new['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate'].max()
[OUT] TypeError: '>=' not supported between instances of 'datetime.date' and 'float'
我想要的
customerId lastInteractivity howMany shortest Longest Average
2 371 1 None None None
1 125 5 4 193 68,4
# shortes, longest, average is None because customer with the Id 2 had only 1 date
使用:
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
#count how many missing dates
m = df['lastInteractivity'].notna()
df1 = (df[~m].groupby(['customerId', m.cumsum()])['customerId']
.size()
.add(1)
.reset_index(name='Count'))
print (df1)
customerId lastInteractivity Count
0 1 1 12
1 1 2 24
2 1 3 4
3 1 4 169
4 1 5 133
df1 = df1.groupby('customerId').agg(howMany=('Count','size'),
shortest=('Count','min'),
Longest=('Count','max'),
Average=('Count','mean'))
#get last lastInteractivity and joined df1
df = (df.groupby('customerId')['lastInteractivity']
.last()
.dt.days
.sort_index(ascending=False)
.to_frame()
.join(df1)
.reset_index())
print (df)
customerId lastInteractivity howMany shortest Longest Average
0 2 371 NaN NaN NaN NaN
1 1 125 5.0 4.0 169.0 68.4
我有问题。我想回答一些问题(见下文)。不幸的是我得到了一个错误 ValueError: Wrong number of items passed 0, placement implies 1
。我如何确定问题?
- 最后一次互动是多少天前(从今天算起)?
- 客户下了多少订单?
- 最快的交互是什么时候?
- 最短的互动时间是什么时候?
- 平均互动度是多少? (为此我计算了天数)
From 2021-02-10 to 2021-02-22 = 12
From 2021-02-22 to 2021-03-18 = 24
From 2021-03-18 to 2021-03-22 = 4
From 2021-03-22 to 2021-09-07 = 109
From 2021-09-07 to 2022-01-18 = 193
-------
68,4 (average days)
数据框
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
代码
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17']
}
df = pd.DataFrame(data=d)
display(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
#df = df['fromDate'].groupby('customerId', group_keys=False)
df = df.sort_values(['customerId','fromDate'],ascending=False)#.groupby('customerId')
df_new = pd.DataFrame()
df_new['average'] = df.groupby('customerId').mean()
[OUT] AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
df_new = pd.DataFrame()
df_new['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate'].max()
[OUT] TypeError: '>=' not supported between instances of 'datetime.date' and 'float'
我想要的
customerId lastInteractivity howMany shortest Longest Average
2 371 1 None None None
1 125 5 4 193 68,4
# shortes, longest, average is None because customer with the Id 2 had only 1 date
使用:
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
#count how many missing dates
m = df['lastInteractivity'].notna()
df1 = (df[~m].groupby(['customerId', m.cumsum()])['customerId']
.size()
.add(1)
.reset_index(name='Count'))
print (df1)
customerId lastInteractivity Count
0 1 1 12
1 1 2 24
2 1 3 4
3 1 4 169
4 1 5 133
df1 = df1.groupby('customerId').agg(howMany=('Count','size'),
shortest=('Count','min'),
Longest=('Count','max'),
Average=('Count','mean'))
#get last lastInteractivity and joined df1
df = (df.groupby('customerId')['lastInteractivity']
.last()
.dt.days
.sort_index(ascending=False)
.to_frame()
.join(df1)
.reset_index())
print (df)
customerId lastInteractivity howMany shortest Longest Average
0 2 371 NaN NaN NaN NaN
1 1 125 5.0 4.0 169.0 68.4