确定每月至少购买一次的客户并分配给类别
identify customers with at least one purchase per month and assign to category
我有一个包含销售数据的数据框,它看起来像这样
customer_id date store_location amount_paid year month
442608921 2021-01-01 Austin 11968 2021 1
865639331 2021-01-01 San Antonio 41970 2021 1
442643778 2021-01-01 Denver 900 2021 1
442643777 2021-01-01 Denver 2258 2021 1
442643774 2021-01-01 Boston 866 2021 1
442643775 2021-01-01 Los Angeles 866 2021 1
442643776 2021-01-01 Austin 1194 2021 1
601469342 2021-01-01 Austin 5163 2021 1
333570465 2021-01-01 Denver 8000 2021 1
数据是从 2021 年 1 月 1 日到 2022 年 4 月 30 日
我想确定在此期间每月至少进行一次购买的客户,并为每月至少进行一次购买的客户创建一个值为 1 的新列,为不太活跃或不活跃的客户创建一个值为 0 的列。我怎样才能用 python 做到这一点?
我试过了,它给了我每年和每月的购买次数,但我还没弄清楚如何分配值 0 和 1。
grpd=df.groupby(['customer_id','year','month']).size().to_frame('n_purchases').reset_index().sort_values(['customer_id, 'year', 'month'], ascending=[True, True, True])
grpd
这里有一种方法可以完成您的问题:
def foo(x):
y = {(x.year[i], x.month[i]) for i in x.index}
year, month = 2021, 1
for i in range(16):
if (year, month) not in y:
return False
month = month % 12 + 1
year = year + (1 if month == 1 else 0)
return True
df2 = df.groupby('customer_id').apply(foo).to_frame().rename(columns={0:'frequent_customer'})
df = df.join(df2, on='customer_id')
解释:
- group by
customer_id
并使用 apply 检查每个组是否可以在唯一的 year, month
购买集合中找到从 2021-01 到 2022-04 的每个 year, month
元组该组的行
- 在新数据框中命名布尔列
frequent_customer
,每个 customer_id
一行
- 使用
join
向原始数据框添加frequent_customer
列
完整测试代码:
import pandas as pd
df = pd.DataFrame({
'customer_id': [1,2,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5],
'date':[
'2021-01-01',
'2021-01-01',
'2021-01-01',
'2021-01-01','2021-02-01','2021-03-01','2021-04-01','2021-05-01','2021-06-01','2021-07-01','2021-08-01','2021-09-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01',
'2021-01-01'],
'store_location':['Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin'],
'amount_paid':[900]*20
})
df[['year','month']] = pd.DataFrame(df.date.str.split('-').str.slice(0, 2).tolist(), index = df.index).astype(int)
print(df)
def foo(x):
y = {(x.year[i], x.month[i]) for i in x.index}
year, month = 2021, 1
for i in range(16):
if (year, month) not in y:
return False
month = month % 12 + 1
year = year + (1 if month == 1 else 0)
return True
df2 = df.groupby('customer_id').apply(foo).to_frame().rename(columns={0:'frequent_customer'})
df = df.join(df2, on='customer_id')
print(df)
输入:
customer_id date store_location amount_paid year month
0 1 2021-01-01 Austin 900 2021 1
1 2 2021-01-01 Austin 900 2021 1
2 3 2021-01-01 Austin 900 2021 1
3 4 2021-01-01 Austin 900 2021 1
4 4 2021-02-01 Austin 900 2021 2
5 4 2021-03-01 Austin 900 2021 3
6 4 2021-04-01 Austin 900 2021 4
7 4 2021-05-01 Austin 900 2021 5
8 4 2021-06-01 Austin 900 2021 6
9 4 2021-07-01 Austin 900 2021 7
10 4 2021-08-01 Austin 900 2021 8
11 4 2021-09-01 Austin 900 2021 9
12 4 2021-10-01 Austin 900 2021 10
13 4 2021-11-01 Austin 900 2021 11
14 4 2021-12-01 Austin 900 2021 12
15 4 2022-01-01 Austin 900 2022 1
16 4 2022-02-01 Austin 900 2022 2
17 4 2022-03-01 Austin 900 2022 3
18 4 2022-04-01 Austin 900 2022 4
19 5 2021-01-01 Austin 900 2021 1
输出:
customer_id date store_location amount_paid year month frequent_customer
0 1 2021-01-01 Austin 900 2021 1 False
1 2 2021-01-01 Austin 900 2021 1 False
2 3 2021-01-01 Austin 900 2021 1 False
3 4 2021-01-01 Austin 900 2021 1 True
4 4 2021-02-01 Austin 900 2021 2 True
5 4 2021-03-01 Austin 900 2021 3 True
6 4 2021-04-01 Austin 900 2021 4 True
7 4 2021-05-01 Austin 900 2021 5 True
8 4 2021-06-01 Austin 900 2021 6 True
9 4 2021-07-01 Austin 900 2021 7 True
10 4 2021-08-01 Austin 900 2021 8 True
11 4 2021-09-01 Austin 900 2021 9 True
12 4 2021-10-01 Austin 900 2021 10 True
13 4 2021-11-01 Austin 900 2021 11 True
14 4 2021-12-01 Austin 900 2021 12 True
15 4 2022-01-01 Austin 900 2022 1 True
16 4 2022-02-01 Austin 900 2022 2 True
17 4 2022-03-01 Austin 900 2022 3 True
18 4 2022-04-01 Austin 900 2022 4 True
19 5 2021-01-01 Austin 900 2021 1 False
我有一个包含销售数据的数据框,它看起来像这样
customer_id date store_location amount_paid year month
442608921 2021-01-01 Austin 11968 2021 1
865639331 2021-01-01 San Antonio 41970 2021 1
442643778 2021-01-01 Denver 900 2021 1
442643777 2021-01-01 Denver 2258 2021 1
442643774 2021-01-01 Boston 866 2021 1
442643775 2021-01-01 Los Angeles 866 2021 1
442643776 2021-01-01 Austin 1194 2021 1
601469342 2021-01-01 Austin 5163 2021 1
333570465 2021-01-01 Denver 8000 2021 1
数据是从 2021 年 1 月 1 日到 2022 年 4 月 30 日
我想确定在此期间每月至少进行一次购买的客户,并为每月至少进行一次购买的客户创建一个值为 1 的新列,为不太活跃或不活跃的客户创建一个值为 0 的列。我怎样才能用 python 做到这一点?
我试过了,它给了我每年和每月的购买次数,但我还没弄清楚如何分配值 0 和 1。
grpd=df.groupby(['customer_id','year','month']).size().to_frame('n_purchases').reset_index().sort_values(['customer_id, 'year', 'month'], ascending=[True, True, True])
grpd
这里有一种方法可以完成您的问题:
def foo(x):
y = {(x.year[i], x.month[i]) for i in x.index}
year, month = 2021, 1
for i in range(16):
if (year, month) not in y:
return False
month = month % 12 + 1
year = year + (1 if month == 1 else 0)
return True
df2 = df.groupby('customer_id').apply(foo).to_frame().rename(columns={0:'frequent_customer'})
df = df.join(df2, on='customer_id')
解释:
- group by
customer_id
并使用 apply 检查每个组是否可以在唯一的year, month
购买集合中找到从 2021-01 到 2022-04 的每个year, month
元组该组的行 - 在新数据框中命名布尔列
frequent_customer
,每个customer_id
一行
- 使用
join
向原始数据框添加frequent_customer
列
完整测试代码:
import pandas as pd
df = pd.DataFrame({
'customer_id': [1,2,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5],
'date':[
'2021-01-01',
'2021-01-01',
'2021-01-01',
'2021-01-01','2021-02-01','2021-03-01','2021-04-01','2021-05-01','2021-06-01','2021-07-01','2021-08-01','2021-09-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01',
'2021-01-01'],
'store_location':['Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin'],
'amount_paid':[900]*20
})
df[['year','month']] = pd.DataFrame(df.date.str.split('-').str.slice(0, 2).tolist(), index = df.index).astype(int)
print(df)
def foo(x):
y = {(x.year[i], x.month[i]) for i in x.index}
year, month = 2021, 1
for i in range(16):
if (year, month) not in y:
return False
month = month % 12 + 1
year = year + (1 if month == 1 else 0)
return True
df2 = df.groupby('customer_id').apply(foo).to_frame().rename(columns={0:'frequent_customer'})
df = df.join(df2, on='customer_id')
print(df)
输入:
customer_id date store_location amount_paid year month
0 1 2021-01-01 Austin 900 2021 1
1 2 2021-01-01 Austin 900 2021 1
2 3 2021-01-01 Austin 900 2021 1
3 4 2021-01-01 Austin 900 2021 1
4 4 2021-02-01 Austin 900 2021 2
5 4 2021-03-01 Austin 900 2021 3
6 4 2021-04-01 Austin 900 2021 4
7 4 2021-05-01 Austin 900 2021 5
8 4 2021-06-01 Austin 900 2021 6
9 4 2021-07-01 Austin 900 2021 7
10 4 2021-08-01 Austin 900 2021 8
11 4 2021-09-01 Austin 900 2021 9
12 4 2021-10-01 Austin 900 2021 10
13 4 2021-11-01 Austin 900 2021 11
14 4 2021-12-01 Austin 900 2021 12
15 4 2022-01-01 Austin 900 2022 1
16 4 2022-02-01 Austin 900 2022 2
17 4 2022-03-01 Austin 900 2022 3
18 4 2022-04-01 Austin 900 2022 4
19 5 2021-01-01 Austin 900 2021 1
输出:
customer_id date store_location amount_paid year month frequent_customer
0 1 2021-01-01 Austin 900 2021 1 False
1 2 2021-01-01 Austin 900 2021 1 False
2 3 2021-01-01 Austin 900 2021 1 False
3 4 2021-01-01 Austin 900 2021 1 True
4 4 2021-02-01 Austin 900 2021 2 True
5 4 2021-03-01 Austin 900 2021 3 True
6 4 2021-04-01 Austin 900 2021 4 True
7 4 2021-05-01 Austin 900 2021 5 True
8 4 2021-06-01 Austin 900 2021 6 True
9 4 2021-07-01 Austin 900 2021 7 True
10 4 2021-08-01 Austin 900 2021 8 True
11 4 2021-09-01 Austin 900 2021 9 True
12 4 2021-10-01 Austin 900 2021 10 True
13 4 2021-11-01 Austin 900 2021 11 True
14 4 2021-12-01 Austin 900 2021 12 True
15 4 2022-01-01 Austin 900 2022 1 True
16 4 2022-02-01 Austin 900 2022 2 True
17 4 2022-03-01 Austin 900 2022 3 True
18 4 2022-04-01 Austin 900 2022 4 True
19 5 2021-01-01 Austin 900 2021 1 False