使用 pandas 计算收入流失
calculating revenue churn using pandas
假设我有一个 pandas 数据框,df:
cust | year | revenue
1 | 2013 | 100
1 | 2013 | 50
2 | 2013 | 70
2 | 2015 | 10
3 | 2016 | 10
3 | 2019 | 65
...
我希望能够计算停止业务的客户的收入损失。例如,由于客户 #1 在 2013 年之后停止开展业务,我们可以说 2014 年有 50 美元的流失。
我想按年计算所有流失收入(损失收入)的总和。输出将类似于:
YEAR
2013 0
2014 150
2015 0
2016 10
2017 0
2018 0
2019 0
2020 65
我目前的逻辑如下:从每个客户的最大 year/latest 交易中获取收入,然后将所有这些值相加,按年份分组。
该问题被设计为诱使人们将“流失”视为一种二元条件:客户要么继续与您的公司开展业务,要么不再与您的公司开展业务。
但是,您仍然想知道当客户停止与您的公司开展业务时您将损失多少钱,然后将输出从二进制变量更改为数字变量。
我按年计算了客户流失率,然后汇总了整个公司每年收入的所有变化,我认为这可以解决您的问题。
我用过这个合成数据:
import numpy as np
import pandas as pd
# Sets random seed
np.random.seed(42)
# Sample size
size= 10**3
# Creates customers series
unique_customers = np.arange(1, 51)
customers = [unique_customers[i] for i in
np.random.randint(low=0, high=len(unique_customers), size=size)]
# Creates date series
unique_dates = pd.date_range(start="2013-01-01", end="2017-12-31", freq="D")
unique_years = unique_dates.year.unique()
dates = [unique_dates[i] for i in
np.random.randint(low=0, high=len(unique_dates), size=size)]
# Creates revenues series
unique_revenues = [100, 50, 70, 10, 65]
revenue = [unique_revenues[i] for i in
np.random.randint(low=0, high=len(unique_revenues), size=size)]
# Creates Pandas DataFrame
data = (pd.DataFrame({'Customer': customers,
'Date': dates,
'Revenue': revenue})
.set_index('Date')
.sort_index())
# Randomly sets customers revenues to zero
rem_customers = set([unique_customers[i] for i in
np.random.randint(low=0, high=len(unique_customers), size=10)])
for cust in rem_customers:
rem_year = unique_years[np.random.randint(low=2, high=len(unique_years), size=1)].values[0]
data.loc[(data.index.year >= rem_year) & (data['Customer'] == cust), "Revenue"] = np.nan
数据集是这样的:
Customer Revenue
Date
2013-01-01 36 100.0
2013-01-03 1 10.0
2013-01-03 47 50.0
2013-01-04 28 10.0
2013-01-04 25 65.0
我特意将一些客户的收入在给定日期后设置为零,以说明问题。
例如,您可以使用 groupby()
to define the levels of aggregation, diff()
to calculate the lost revenue at the given level of aggregation and resample()
将时间序列频率从每日更改为每年。
# Calculates revenue by customer
revenue_by_customer = data.groupby([pd.Grouper(freq='AS'),'Customer']).sum()
# Calculates the change in revenue by customer year on year
diff_revenue_by_customer = (
revenue_by_customer.groupby(['Customer'])
.diff(1)
.rename(columns={'Revenue':'Revenue_change'})
)
# Calculates total change in revenue year on year
diff_revenue_per_year = diff_revenue_by_customer.droplevel(1).resample('AS').sum()
40 号客户在 2014 年之后停止与公司开展业务,他们的记录如下:
revenue_by_customer.xs(40, level=1, drop_level=False)
>> Revenue
>> Date Customer
>> 2013-01-01 40 285.0
>> 2014-01-01 40 195.0
>> 2015-01-01 40 0.0
>> 2016-01-01 40 0.0
>> 2017-01-01 40 0.0
diff_revenue_by_customer.xs(40, level=1, drop_level=False)
>> Revenue_change
>> Date Customer
>> 2013-01-01 40 NaN
>> 2014-01-01 40 -90.0
>> 2015-01-01 40 -195.0
>> 2016-01-01 40 0.0
>> 2017-01-01 40 0.0
当我们将每年的收入变化相加时,结果是 table:
diff_revenue_per_year.head()
>> Revenue_change
>> Date
>> 2013-01-01 0.0
>> 2014-01-01 -2490.0
>> 2015-01-01 2255.0
>> 2016-01-01 -1545.0
>> 2017-01-01 -1305.0
您也可以只计算同比损失的收入,代码如下:
lost_revenue_per_year = (
diff_revenue_by_customer
.loc[diff_revenue_by_customer['Revenue_change']<0]
.droplevel(1)
.resample('AS')
.sum()
.rename(columns={'Revenue_change':'Lost_revenue'})
)
lost_revenue_per_year.head()
>> Lost_revenue
>> Date
>> 2014-01-01 -4935.0
>> 2015-01-01 -2725.0
>> 2016-01-01 -4290.0
>> 2017-01-01 -3995.0
假设我有一个 pandas 数据框,df:
cust | year | revenue
1 | 2013 | 100
1 | 2013 | 50
2 | 2013 | 70
2 | 2015 | 10
3 | 2016 | 10
3 | 2019 | 65
...
我希望能够计算停止业务的客户的收入损失。例如,由于客户 #1 在 2013 年之后停止开展业务,我们可以说 2014 年有 50 美元的流失。
我想按年计算所有流失收入(损失收入)的总和。输出将类似于:
YEAR
2013 0
2014 150
2015 0
2016 10
2017 0
2018 0
2019 0
2020 65
我目前的逻辑如下:从每个客户的最大 year/latest 交易中获取收入,然后将所有这些值相加,按年份分组。
该问题被设计为诱使人们将“流失”视为一种二元条件:客户要么继续与您的公司开展业务,要么不再与您的公司开展业务。 但是,您仍然想知道当客户停止与您的公司开展业务时您将损失多少钱,然后将输出从二进制变量更改为数字变量。
我按年计算了客户流失率,然后汇总了整个公司每年收入的所有变化,我认为这可以解决您的问题。
我用过这个合成数据:
import numpy as np
import pandas as pd
# Sets random seed
np.random.seed(42)
# Sample size
size= 10**3
# Creates customers series
unique_customers = np.arange(1, 51)
customers = [unique_customers[i] for i in
np.random.randint(low=0, high=len(unique_customers), size=size)]
# Creates date series
unique_dates = pd.date_range(start="2013-01-01", end="2017-12-31", freq="D")
unique_years = unique_dates.year.unique()
dates = [unique_dates[i] for i in
np.random.randint(low=0, high=len(unique_dates), size=size)]
# Creates revenues series
unique_revenues = [100, 50, 70, 10, 65]
revenue = [unique_revenues[i] for i in
np.random.randint(low=0, high=len(unique_revenues), size=size)]
# Creates Pandas DataFrame
data = (pd.DataFrame({'Customer': customers,
'Date': dates,
'Revenue': revenue})
.set_index('Date')
.sort_index())
# Randomly sets customers revenues to zero
rem_customers = set([unique_customers[i] for i in
np.random.randint(low=0, high=len(unique_customers), size=10)])
for cust in rem_customers:
rem_year = unique_years[np.random.randint(low=2, high=len(unique_years), size=1)].values[0]
data.loc[(data.index.year >= rem_year) & (data['Customer'] == cust), "Revenue"] = np.nan
数据集是这样的:
Customer Revenue
Date
2013-01-01 36 100.0
2013-01-03 1 10.0
2013-01-03 47 50.0
2013-01-04 28 10.0
2013-01-04 25 65.0
我特意将一些客户的收入在给定日期后设置为零,以说明问题。
例如,您可以使用 groupby()
to define the levels of aggregation, diff()
to calculate the lost revenue at the given level of aggregation and resample()
将时间序列频率从每日更改为每年。
# Calculates revenue by customer
revenue_by_customer = data.groupby([pd.Grouper(freq='AS'),'Customer']).sum()
# Calculates the change in revenue by customer year on year
diff_revenue_by_customer = (
revenue_by_customer.groupby(['Customer'])
.diff(1)
.rename(columns={'Revenue':'Revenue_change'})
)
# Calculates total change in revenue year on year
diff_revenue_per_year = diff_revenue_by_customer.droplevel(1).resample('AS').sum()
40 号客户在 2014 年之后停止与公司开展业务,他们的记录如下:
revenue_by_customer.xs(40, level=1, drop_level=False)
>> Revenue
>> Date Customer
>> 2013-01-01 40 285.0
>> 2014-01-01 40 195.0
>> 2015-01-01 40 0.0
>> 2016-01-01 40 0.0
>> 2017-01-01 40 0.0
diff_revenue_by_customer.xs(40, level=1, drop_level=False)
>> Revenue_change
>> Date Customer
>> 2013-01-01 40 NaN
>> 2014-01-01 40 -90.0
>> 2015-01-01 40 -195.0
>> 2016-01-01 40 0.0
>> 2017-01-01 40 0.0
当我们将每年的收入变化相加时,结果是 table:
diff_revenue_per_year.head()
>> Revenue_change
>> Date
>> 2013-01-01 0.0
>> 2014-01-01 -2490.0
>> 2015-01-01 2255.0
>> 2016-01-01 -1545.0
>> 2017-01-01 -1305.0
您也可以只计算同比损失的收入,代码如下:
lost_revenue_per_year = (
diff_revenue_by_customer
.loc[diff_revenue_by_customer['Revenue_change']<0]
.droplevel(1)
.resample('AS')
.sum()
.rename(columns={'Revenue_change':'Lost_revenue'})
)
lost_revenue_per_year.head()
>> Lost_revenue
>> Date
>> 2014-01-01 -4935.0
>> 2015-01-01 -2725.0
>> 2016-01-01 -4290.0
>> 2017-01-01 -3995.0