使用 pandas 计算收入流失

calculating revenue churn using pandas

假设我有一个 pandas 数据框,df:

cust | year | revenue
1 | 2013 | 100
1 | 2013 | 50
2 | 2013 | 70
2 | 2015 | 10
3 | 2016 | 10
3 | 2019 | 65
... 

我希望能够计算停止业务的客户的收入损失。例如,由于客户 #1 在 2013 年之后停止开展业务,我们可以说 2014 年有 50 美元的流失。

我想按年计算所有流失收入(损失收入)的总和。输出将类似于:

YEAR
2013 0
2014 150
2015 0
2016 10
2017 0
2018 0
2019 0
2020 65

我目前的逻辑如下:从每个客户的最大 year/latest 交易中获取收入,然后将所有这些值相加,按年份分组。

该问题被设计为诱使人们将“流失”视为一种二元条件:客户要么继续与您的公司开展业务,要么不再与您的公司开展业务。 但是,您仍然想知道当客户停止与您的公司开展业务时您将损失多少钱,然后将输出从二进制变量更改为数字变量。

我按年计算了客户流失率,然后汇总了整个公司每年收入的所有变化,我认为这可以解决您的问题。

我用过这个合成数据:

import numpy as np
import pandas as pd

# Sets random seed
np.random.seed(42)

# Sample size
size= 10**3

# Creates customers series
unique_customers = np.arange(1, 51)

customers = [unique_customers[i] for i in
             np.random.randint(low=0, high=len(unique_customers), size=size)]

# Creates date series
unique_dates = pd.date_range(start="2013-01-01", end="2017-12-31", freq="D")

unique_years = unique_dates.year.unique()

dates = [unique_dates[i] for i in
         np.random.randint(low=0, high=len(unique_dates), size=size)]

# Creates revenues series
unique_revenues = [100, 50, 70, 10, 65]

revenue = [unique_revenues[i] for i in
           np.random.randint(low=0, high=len(unique_revenues), size=size)]

# Creates Pandas DataFrame
data = (pd.DataFrame({'Customer': customers,
                     'Date': dates,
                     'Revenue': revenue})
        .set_index('Date')
        .sort_index())

# Randomly sets customers revenues to zero
rem_customers = set([unique_customers[i] for i in
                     np.random.randint(low=0, high=len(unique_customers), size=10)])

for cust in rem_customers:
    rem_year = unique_years[np.random.randint(low=2, high=len(unique_years), size=1)].values[0]
    data.loc[(data.index.year >= rem_year) & (data['Customer'] == cust), "Revenue"] = np.nan

数据集是这样的:

            Customer  Revenue
Date                         
2013-01-01        36    100.0
2013-01-03         1     10.0
2013-01-03        47     50.0
2013-01-04        28     10.0
2013-01-04        25     65.0

我特意将一些客户的收入在给定日期后设置为零,以说明问题。

例如,您可以使用 groupby() to define the levels of aggregation, diff() to calculate the lost revenue at the given level of aggregation and resample() 将时间序列频率从每日更改为每年。

# Calculates revenue by customer
revenue_by_customer = data.groupby([pd.Grouper(freq='AS'),'Customer']).sum()

# Calculates the change in revenue by customer year on year
diff_revenue_by_customer = (
revenue_by_customer.groupby(['Customer'])
.diff(1)
.rename(columns={'Revenue':'Revenue_change'})
)

# Calculates total change in revenue year on year
diff_revenue_per_year = diff_revenue_by_customer.droplevel(1).resample('AS').sum()

40 号客户在 2014 年之后停止与公司开展业务,他们的记录如下:

revenue_by_customer.xs(40, level=1, drop_level=False)

>>                      Revenue
>> Date       Customer         
>> 2013-01-01 40          285.0
>> 2014-01-01 40          195.0
>> 2015-01-01 40            0.0
>> 2016-01-01 40            0.0
>> 2017-01-01 40            0.0

diff_revenue_by_customer.xs(40, level=1, drop_level=False)

>>                       Revenue_change
>> Date       Customer         
>> 2013-01-01 40            NaN
>> 2014-01-01 40          -90.0
>> 2015-01-01 40         -195.0
>> 2016-01-01 40            0.0
>> 2017-01-01 40            0.0

当我们将每年的收入变化相加时,结果是 table:

diff_revenue_per_year.head()

>>             Revenue_change
>> Date               
>> 2013-01-01      0.0
>> 2014-01-01  -2490.0
>> 2015-01-01   2255.0
>> 2016-01-01  -1545.0
>> 2017-01-01  -1305.0

您也可以只计算同比损失的收入,代码如下:

lost_revenue_per_year = (
    diff_revenue_by_customer
    .loc[diff_revenue_by_customer['Revenue_change']<0]
    .droplevel(1)
    .resample('AS')
    .sum()
    .rename(columns={'Revenue_change':'Lost_revenue'})
    )

lost_revenue_per_year.head()

>>             Lost_revenue
>> Date                    
>> 2014-01-01       -4935.0
>> 2015-01-01       -2725.0
>> 2016-01-01       -4290.0
>> 2017-01-01       -3995.0