确定每月至少购买一次的客户并分配给类别

identify customers with at least one purchase per month and assign to category

我有一个包含销售数据的数据框,它看起来像这样

customer_id date    store_location  amount_paid year    month
442608921   2021-01-01  Austin  11968   2021    1
865639331   2021-01-01  San Antonio 41970   2021    1
442643778   2021-01-01  Denver  900 2021    1
442643777   2021-01-01  Denver  2258    2021    1
442643774   2021-01-01  Boston  866 2021    1
442643775   2021-01-01  Los Angeles 866 2021    1
442643776   2021-01-01  Austin  1194    2021    1
601469342   2021-01-01  Austin  5163    2021    1
333570465   2021-01-01  Denver  8000    2021    1

数据是从 2021 年 1 月 1 日到 2022 年 4 月 30 日

我想确定在此期间每月至少进行一次购买的客户,并为每月至少进行一次购买的客户创建一个值为 1 的新列,为不太活跃或不活跃的客户创建一个值为 0 的列。我怎样才能用 python 做到这一点?

我试过了,它给了我每年和每月的购买次数,但我还没弄清楚如何分配值 0 和 1。

grpd=df.groupby(['customer_id','year','month']).size().to_frame('n_purchases').reset_index().sort_values(['customer_id, 'year', 'month'], ascending=[True, True, True])
grpd

这里有一种方法可以完成您的问题:

def foo(x):
    y = {(x.year[i], x.month[i]) for i in x.index}
    year, month = 2021, 1
    for i in range(16):
        if (year, month) not in y:
            return False
        month = month % 12 + 1
        year = year + (1 if month == 1 else 0)
    return True
    
df2 = df.groupby('customer_id').apply(foo).to_frame().rename(columns={0:'frequent_customer'})
df = df.join(df2, on='customer_id')

解释:

  • group by customer_id 并使用 apply 检查每个组是否可以在唯一的 year, month 购买集合中找到从 2021-01 到 2022-04 的每个 year, month 元组该组的行
  • 在新数据框中命名布尔列 frequent_customer,每个 customer_id
  • 一行
  • 使用join向原始数据框添加frequent_customer

完整测试代码:

import pandas as pd
df = pd.DataFrame({
'customer_id': [1,2,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5],
'date':[
    '2021-01-01',
    '2021-01-01',
    '2021-01-01',
    '2021-01-01','2021-02-01','2021-03-01','2021-04-01','2021-05-01','2021-06-01','2021-07-01','2021-08-01','2021-09-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01',
    '2021-01-01'],
'store_location':['Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin','Austin'],
'amount_paid':[900]*20
})
df[['year','month']] = pd.DataFrame(df.date.str.split('-').str.slice(0, 2).tolist(), index = df.index).astype(int)
print(df)

def foo(x):
    y = {(x.year[i], x.month[i]) for i in x.index}
    year, month = 2021, 1
    for i in range(16):
        if (year, month) not in y:
            return False
        month = month % 12 + 1
        year = year + (1 if month == 1 else 0)
    return True
    
df2 = df.groupby('customer_id').apply(foo).to_frame().rename(columns={0:'frequent_customer'})
df = df.join(df2, on='customer_id')
print(df)

输入:

    customer_id        date store_location  amount_paid  year  month
0             1  2021-01-01         Austin          900  2021      1
1             2  2021-01-01         Austin          900  2021      1
2             3  2021-01-01         Austin          900  2021      1
3             4  2021-01-01         Austin          900  2021      1
4             4  2021-02-01         Austin          900  2021      2
5             4  2021-03-01         Austin          900  2021      3
6             4  2021-04-01         Austin          900  2021      4
7             4  2021-05-01         Austin          900  2021      5
8             4  2021-06-01         Austin          900  2021      6
9             4  2021-07-01         Austin          900  2021      7
10            4  2021-08-01         Austin          900  2021      8
11            4  2021-09-01         Austin          900  2021      9
12            4  2021-10-01         Austin          900  2021     10
13            4  2021-11-01         Austin          900  2021     11
14            4  2021-12-01         Austin          900  2021     12
15            4  2022-01-01         Austin          900  2022      1
16            4  2022-02-01         Austin          900  2022      2
17            4  2022-03-01         Austin          900  2022      3
18            4  2022-04-01         Austin          900  2022      4
19            5  2021-01-01         Austin          900  2021      1

输出:

    customer_id        date store_location  amount_paid  year  month  frequent_customer
0             1  2021-01-01         Austin          900  2021      1              False
1             2  2021-01-01         Austin          900  2021      1              False
2             3  2021-01-01         Austin          900  2021      1              False
3             4  2021-01-01         Austin          900  2021      1               True
4             4  2021-02-01         Austin          900  2021      2               True
5             4  2021-03-01         Austin          900  2021      3               True
6             4  2021-04-01         Austin          900  2021      4               True
7             4  2021-05-01         Austin          900  2021      5               True
8             4  2021-06-01         Austin          900  2021      6               True
9             4  2021-07-01         Austin          900  2021      7               True
10            4  2021-08-01         Austin          900  2021      8               True
11            4  2021-09-01         Austin          900  2021      9               True
12            4  2021-10-01         Austin          900  2021     10               True
13            4  2021-11-01         Austin          900  2021     11               True
14            4  2021-12-01         Austin          900  2021     12               True
15            4  2022-01-01         Austin          900  2022      1               True
16            4  2022-02-01         Austin          900  2022      2               True
17            4  2022-03-01         Austin          900  2022      3               True
18            4  2022-04-01         Austin          900  2022      4               True
19            5  2021-01-01         Austin          900  2021      1              False