python 最近 30 天使用 lambda 函数的聚合
aggregation with lambda function for last 30 days with python
我正在尝试获取一个名为 'sales_30d_lag' 的列,其中包含每个 user_id 最近 'Date' 过去 30 天的总销售额。当我 运行 这段代码时,我得到了结果,但是当我将它与基于 user_id 的原始数据框合并时, 'sales_30d_lag' 列显示 NaN 值 - 有什么问题吗?
df_30d_lag= df.groupby(['user_ID']).apply(lambda df: df[(df['Date'] \
>=(df['Date'].max() -pd.to_timedelta(30, unit='d')))].agg({'sales': 'sum'}))
.rename(columns={'sales':'sales_30d_lag'})
如果没有数据示例(和合并代码)很难猜测,lambda 本身看起来不错 - 我在这个数据集上测试了它:
from io import StringIO
data = """user_ID,Date,sales
1,2012-09-01 10:00:00,10.0
1,2012-09-02 11:00:00,10.0
1,2012-09-03 12:00:00,10.0
1,2012-10-01 13:00:00,10.0
1,2012-10-02 14:00:00,10.0
1,2012-10-03 15:00:00,10.0
1,2012-10-04 16:00:00,10.0
1,2012-11-01 17:00:00,10.0
2,2012-09-01 18:00:00,20.0
2,2012-09-02 19:00:00,20.0
2,2012-09-03 20:00:00,20.0
2,2012-09-04 21:00:00,20.0
2,2012-09-05 22:00:00,20.0
2,2012-09-06 23:00:00,
3,2012-09-06 23:00:00,30.0"""
df = pd.read_csv(StringIO(data), engine="python", parse_dates=["Date"])
代码给出了正确的结果:
df_30d_lag = df.groupby(['user_ID']).apply(lambda df: \
df[(df['Date'] >=(df['Date'].max() - pd.to_timedelta(30, unit='d')))]\
.agg({'sales': 'sum'}))\
.rename(columns={'sales':'sales_30d_lag'})
# sales_30d_lag
#user_ID
#1 30.0
#2 100.0
#3 30.0
也许,合并本身就是一个问题 - df_30d_lag
被 user_ID
索引了。要合并它,您必须重置索引并合并到列 user_ID
或执行类似的操作:
df.merge(df_30d_lag, left_on='user_ID', right_index=True)
# user_ID Date sales sales_30d_lag
#0 1 2012-09-01 10:00:00 10.0 30.0
#1 1 2012-09-02 11:00:00 10.0 30.0
#2 1 2012-09-03 12:00:00 10.0 30.0
#3 1 2012-10-01 13:00:00 10.0 30.0
#4 1 2012-10-02 14:00:00 10.0 30.0
#5 1 2012-10-03 15:00:00 10.0 30.0
#6 1 2012-10-04 16:00:00 10.0 30.0
#7 1 2012-11-01 17:00:00 10.0 30.0
#8 2 2012-09-01 18:00:00 20.0 100.0
#9 2 2012-09-02 19:00:00 20.0 100.0
#10 2 2012-09-03 20:00:00 20.0 100.0
#11 2 2012-09-04 21:00:00 20.0 100.0
#12 2 2012-09-05 22:00:00 20.0 100.0
#13 2 2012-09-06 23:00:00 NaN 100.0
#14 3 2012-09-06 23:00:00 30.0 30.0
如果不是这样,请添加数据示例,以便我们更好地重现它。
我正在尝试获取一个名为 'sales_30d_lag' 的列,其中包含每个 user_id 最近 'Date' 过去 30 天的总销售额。当我 运行 这段代码时,我得到了结果,但是当我将它与基于 user_id 的原始数据框合并时, 'sales_30d_lag' 列显示 NaN 值 - 有什么问题吗?
df_30d_lag= df.groupby(['user_ID']).apply(lambda df: df[(df['Date'] \
>=(df['Date'].max() -pd.to_timedelta(30, unit='d')))].agg({'sales': 'sum'}))
.rename(columns={'sales':'sales_30d_lag'})
如果没有数据示例(和合并代码)很难猜测,lambda 本身看起来不错 - 我在这个数据集上测试了它:
from io import StringIO
data = """user_ID,Date,sales
1,2012-09-01 10:00:00,10.0
1,2012-09-02 11:00:00,10.0
1,2012-09-03 12:00:00,10.0
1,2012-10-01 13:00:00,10.0
1,2012-10-02 14:00:00,10.0
1,2012-10-03 15:00:00,10.0
1,2012-10-04 16:00:00,10.0
1,2012-11-01 17:00:00,10.0
2,2012-09-01 18:00:00,20.0
2,2012-09-02 19:00:00,20.0
2,2012-09-03 20:00:00,20.0
2,2012-09-04 21:00:00,20.0
2,2012-09-05 22:00:00,20.0
2,2012-09-06 23:00:00,
3,2012-09-06 23:00:00,30.0"""
df = pd.read_csv(StringIO(data), engine="python", parse_dates=["Date"])
代码给出了正确的结果:
df_30d_lag = df.groupby(['user_ID']).apply(lambda df: \
df[(df['Date'] >=(df['Date'].max() - pd.to_timedelta(30, unit='d')))]\
.agg({'sales': 'sum'}))\
.rename(columns={'sales':'sales_30d_lag'})
# sales_30d_lag
#user_ID
#1 30.0
#2 100.0
#3 30.0
也许,合并本身就是一个问题 - df_30d_lag
被 user_ID
索引了。要合并它,您必须重置索引并合并到列 user_ID
或执行类似的操作:
df.merge(df_30d_lag, left_on='user_ID', right_index=True)
# user_ID Date sales sales_30d_lag
#0 1 2012-09-01 10:00:00 10.0 30.0
#1 1 2012-09-02 11:00:00 10.0 30.0
#2 1 2012-09-03 12:00:00 10.0 30.0
#3 1 2012-10-01 13:00:00 10.0 30.0
#4 1 2012-10-02 14:00:00 10.0 30.0
#5 1 2012-10-03 15:00:00 10.0 30.0
#6 1 2012-10-04 16:00:00 10.0 30.0
#7 1 2012-11-01 17:00:00 10.0 30.0
#8 2 2012-09-01 18:00:00 20.0 100.0
#9 2 2012-09-02 19:00:00 20.0 100.0
#10 2 2012-09-03 20:00:00 20.0 100.0
#11 2 2012-09-04 21:00:00 20.0 100.0
#12 2 2012-09-05 22:00:00 20.0 100.0
#13 2 2012-09-06 23:00:00 NaN 100.0
#14 3 2012-09-06 23:00:00 30.0 30.0
如果不是这样,请添加数据示例,以便我们更好地重现它。