相对于数据框中的现有实体向 pandas 数据框中添加行

Adding rows to a pandas dataframe with respect to existing entities within the dataframe

我有一个数据框,其中填充了如下所示的值。我只有几个月的行,其中实体产生了收入,但实体的生命周期可能比它们显示的收入所描述的更长。

entity revenue_generated first_purchase months_since_first_purchase
A 20 2022-01 0
A 60 2022-01 2
A 80 2022-01 3
A 15 2022-01 5
B 30 2022-03 0
B 10 2022-03 1
B 12 2022-03 2
G 25 2022-01 0
G 19 2022-01 1
G 90 2022-01 2

为了快速解释此上下文中的生命周期,实体 A 于 2022 年 1 月首次购买。自首次购买后 5 个月,他们产生的 15 美元是生命周期的最大值。换句话说,由于我是在 2022 年 6 月写这篇文章,我们无法看到他们在 2022 年 7 月的购买历史。因此他们作为客户的最大潜在生命周期为 5 个月,(代表 1 月、2 月、3 月这 6 个可观察的月份,四月、五月和六月,从 0)

开始计数

为简单起见,假设 B 于 2022 年 3 月首次购买,因此它们的最大值为 3。因此 A 的最大潜在寿命在数据集中表示,但 B 和 G 没有。

G 也在 2022 年 1 月进行了首次购买。因此他们的最大 months_since_first_purchase 价值也是 5,但他们在那个月没有产生任何收入,因此没有代表他们。

我想更改数据集,使其包含每个实体的所有 months_since_first_purchase,并将该月的收入包含为 0。因此,我的目标数据集(强调添加)是:

entity revenue_generated first_purchase months_since_first_purchase
A 20 2022-01 0
A 0 2022-01 1
A 60 2022-01 2
A 80 2022-01 3
A 0 2022-01 4
A 15 2022-01 5
B 30 2022-03 0
B 10 2022-03 1
B 12 2022-03 2
B 0 2022-03 3
G 25 2022-01 0
G 19 2022-01 1
G 90 2022-01 2
G 0 2022-01 3
G 0 2022-01 4
G 0 2022-01 5

我目前在 for 循环中实现了此功能,在该循环中我遍历了一组实体,并为每个实体构建了一个新的数据框并将其连接到一个新的主数据框,但这非常慢。 Pandas 是否有更 pythonic 的方法来解决这个问题,而不涉及迭代和重建新的数据帧?

绝对可以不用循环,但我认为创建一个新的数据框可以稍微简化它。

举个更简单的例子...

  entity  revenue_generated  first_purchase  months_since_first_purchase
0      A                 15          202201                            0
1      A                 20          202201                            1
2      A                 40          202201                            4
3      B                 80          202203                            0
4      B                 60          202203                            2
# reference value to determine number of repeats for each entity
ref_val = 202206
df['repeats'] = ref_val - df['first_purchase']

# create second dataframe with shell of just entity and months since first purchase
df2 = df[['entity','repeats']].drop_duplicates().set_index('entity')
df2 = df2.loc[df2.index.repeat(df2['repeats'])]
df2['months_since_first_purchase'] = df2.groupby(level=-1).cumcount()

# merge back and fill in the rest of the data
df2 = df2.reset_index().drop(columns=['repeats']).merge(df.drop(columns=['repeats']), 'left', on=['entity','months_since_first_purchase'])
df2['revenue_generated'] = df2['revenue_generated'].fillna(0)
df2['first_purchase'] = df2.groupby('entity')['first_purchase'].ffill()
  entity  months_since_first_purchase  revenue_generated  first_purchase
0      A                            0               15.0        202201.0
1      A                            1               20.0        202201.0
2      A                            2                0.0        202201.0
3      A                            3                0.0        202201.0
4      A                            4               40.0        202201.0
5      B                            0               80.0        202203.0
6      B                            1                0.0        202203.0
7      B                            2               60.0        202203.0

一种方法是使用 pd.date_range 方法生成所需的缺失月份,然后与原始数据集合并回来

  • 首先,使用pd.date_range函数
  • 生成now_date之前的缺失日期
now_date = '2022-05-01'
g = df.groupby(['entity']).agg({'first_purchase': 'min'})
g.loc[:, 'all_months'] = g.apply(lambda row: pd.date_range(row['first_purchase'], end=pd.to_datetime(now_date), freq='MS'), axis=1)
  • 接下来,我们将这些日期展开成单独的行并计算 months_since_first_purchase
g_stacked = g.explode('all_months')
g_stacked.loc[:, 'months_since_first_purchase'] = (g_stacked['all_months'].dt.year - pd.to_datetime(g_stacked['first_purchase']).dt.year)*12 + (g_stacked['all_months'].dt.month - pd.to_datetime(g_stacked['first_purchase']).dt.month)
  • 最后,与原始数据集合并并填充空白
g_stacked = g_stacked.set_index('months_since_first_purchase', append=True)
g_stacked = g_stacked.drop('first_purchase', axis=1)
df = df.set_index(['entity', 'months_since_first_purchase'])
df_new = g_stacked.join(df, how='left')
df_new.loc[:, 'revenue_generated'] = df_new['revenue_generated'].fillna(0)
df_new.loc[:, 'first_purchase'] = df_new['first_purchase'].fillna(method='ffill')
df_new = df_new.reset_index()
df_new

这是输出的样子

   entity  months_since_first_purchase all_months  revenue_generated first_purchase
0       A                            0 2022-01-01               20.0        2022-01
1       A                            1 2022-02-01                0.0        2022-01
2       A                            2 2022-03-01               60.0        2022-01
3       A                            3 2022-04-01               80.0        2022-01
4       A                            4 2022-05-01                0.0        2022-01
5       B                            0 2022-03-01               30.0        2022-03
6       B                            1 2022-04-01               10.0        2022-03
7       B                            2 2022-05-01               12.0        2022-03
8       G                            0 2022-01-01               25.0        2022-01
9       G                            1 2022-02-01               19.0        2022-01
10      G                            2 2022-03-01               90.0        2022-01
11      G                            3 2022-04-01                0.0        2022-01
12      G                            4 2022-05-01                0.0        2022-01

这应该能够动态解析您的所有数据

#Finds the Maximum Month since last purchase
df['Count'] = df.groupby('entity')['months_since_first_purchase'].transform(max)

#Creates a new df which has only the max entity and the max month since first purchase
df_counter = df[['entity', 'Count']].drop_duplicates()

#Creates a list the length of the months since first purchase
df_counter['Count'] = df_counter['Count'].apply(lambda x : ([1] + [1] * x))

#explodes the count column to get all possible numbers
df_counter = df_counter.explode('Count')

#Changes to a count instead of just the number 1
df_counter['Count'] = df_counter.groupby('entity')['Count'].cumcount()

#Joins the df_counter df to the main df leftly so all missing values are just np.nan
df_final = pd.merge(df_counter, df, how = 'left', left_on = ['entity', 'Count'], right_on = ['entity', 'months_since_first_purchase'])

#Limits the columns selected
df_final = df_final[['entity', 'Count_x', 'revenue_generated', 'first_purchase']]

#Changes the column names
df_final.columns = ['entity', 'months_since_first_purchase', 'revenue_generated', 'first_purchase']

#Changes all np.nan's to Nones
df_final['revenue_generated'] = df_final['revenue_generated'].replace({np.nan : None})

#Find where 'revenue_generated' == None and change it to 0 else leave it alone
df_final['revenue_generated'] = np.where(df_final['revenue_generated'].values == None, 0, df_final['revenue_generated'])

#Forward fills all the data in the first purchase column to replace the np.nan's
df_final['first_purchase'] = df_final['first_purchase'].ffill()
# First, make a datetime index from the information we have.
df.index = pd.to_datetime(df.first_purchase.str.split('-').str[0] + ' ' +
               (df.first_purchase.str.split('-').str[1].astype(int) 
                + df.months_since_first_purchase).astype(str))

# Groupby entity, pick the desired columns and reindex each group
df = df.groupby('entity')[['revenue_generated', 'first_purchase', 'months_since_first_purchase']].apply(lambda x: x.reindex(pd.date_range('2022-01', periods=6, freq='MS')))

# Reset the index back to what we want
df = df.reset_index(-2).reset_index(drop=True)

# Do the fills
df.revenue_generated = df.revenue_generated.fillna(0)
df.first_purchase = df.groupby('entity')['first_purchase'].ffill()

# Drop the excess data
df = df.dropna(subset='first_purchase')

# Fix the months_since column
df.loc[:,'months_since_first_purchase'] = df.groupby('entity').cumcount().tolist()

输出:

entity revenue_generated first_purchase months_since_first_purchase
A 20 2022-01 0
A 0 2022-01 1
A 60 2022-01 2
A 80 2022-01 3
A 0 2022-01 4
A 15 2022-01 5
B 30 2022-03 0
B 10 2022-03 1
B 12 2022-03 2
B 0 2022-03 3
G 25 2022-01 0
G 19 2022-01 1
G 90 2022-01 2
G 0 2022-01 3
G 0 2022-01 4
G 0 2022-01 5