pandas groupby 然后按日期过滤以获得平均值
pandas groupby then filter by date to get mean
使用 pandas 数据框,我试图根据 CustId 获取每一行(不包括当前行本身)在过去 90 天内的平均购买次数,然后添加一个新列“ PurchaseMeanLast90Days.
这是我试过的代码,错误的是:
group = df.groupby(['CustId'])
df['PurchaseMeanLast90Days'] = group.apply(lambda g: g[g['Date'] > (pd.DatetimeIndex(g['Date']) + pd.DateOffset(-90))])['Purchases'].mean()
这是我的数据:
Index
CustId
Date
Purchases
0
1
1/01/2021
5
1
1
1/12/2021
1
2
1
3/28/2021
2
3
1
4/01/2021
4
4
1
4/20/2021
2
5
1
5/01/2021
5
6
2
1/01/2021
1
7
2
2/01/2021
1
8
2
3/01/2021
2
9
2
4/01/2021
3
例如,行索引 5 会将这些行包含在它的 mean() = 3.33
Index
CustId
Date
Purchases
2
1
3/28/2021
2
3
1
4/01/2021
4
4
1
4/20/2021
2
新数据框看起来像这样(我没有为 CustId=2 计算):
Index
CustId
Date
Purchases
PurchaseMeanLast90Days
0
1
1/09/2021
5
0
1
1
1/12/2021
1
5
2
1
3/28/2021
2
3
3
1
4/01/2021
4
2.67
4
1
4/20/2021
2
3.0
5
1
5/01/2021
5
3.33
6
2
1/01/2021
1
...
7
2
2/01/2021
1
...
8
2
3/01/2021
2
...
9
2
4/01/2021
3
...
您可以进行滚动计算:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=False)
df["PurchaseMeanLast90Days"] = (
(
df.groupby("CustId")
.rolling("90D", min_periods=1, on="Date", closed="both")["Purchases"]
.apply(lambda x: x.shift(1).sum() / (len(x) - 1))
)
.fillna(0)
.values
)
print(df)
打印:
Index CustId Date Purchases PurchaseMeanLast90Days
0 0 1 2021-01-01 5 0.000000
1 1 1 2021-01-12 1 5.000000
2 2 1 2021-03-28 2 3.000000
3 3 1 2021-04-01 4 2.666667
4 4 1 2021-04-20 2 3.000000
5 5 1 2021-05-01 5 2.666667
6 6 2 2021-01-01 1 0.000000
7 7 2 2021-02-01 1 1.000000
8 8 2 2021-03-01 2 1.000000
9 9 2 2021-04-01 3 1.333333
使用 pandas 数据框,我试图根据 CustId 获取每一行(不包括当前行本身)在过去 90 天内的平均购买次数,然后添加一个新列“ PurchaseMeanLast90Days.
这是我试过的代码,错误的是:
group = df.groupby(['CustId'])
df['PurchaseMeanLast90Days'] = group.apply(lambda g: g[g['Date'] > (pd.DatetimeIndex(g['Date']) + pd.DateOffset(-90))])['Purchases'].mean()
这是我的数据:
Index | CustId | Date | Purchases |
---|---|---|---|
0 | 1 | 1/01/2021 | 5 |
1 | 1 | 1/12/2021 | 1 |
2 | 1 | 3/28/2021 | 2 |
3 | 1 | 4/01/2021 | 4 |
4 | 1 | 4/20/2021 | 2 |
5 | 1 | 5/01/2021 | 5 |
6 | 2 | 1/01/2021 | 1 |
7 | 2 | 2/01/2021 | 1 |
8 | 2 | 3/01/2021 | 2 |
9 | 2 | 4/01/2021 | 3 |
例如,行索引 5 会将这些行包含在它的 mean() = 3.33
Index | CustId | Date | Purchases |
---|---|---|---|
2 | 1 | 3/28/2021 | 2 |
3 | 1 | 4/01/2021 | 4 |
4 | 1 | 4/20/2021 | 2 |
新数据框看起来像这样(我没有为 CustId=2 计算):
Index | CustId | Date | Purchases | PurchaseMeanLast90Days |
---|---|---|---|---|
0 | 1 | 1/09/2021 | 5 | 0 |
1 | 1 | 1/12/2021 | 1 | 5 |
2 | 1 | 3/28/2021 | 2 | 3 |
3 | 1 | 4/01/2021 | 4 | 2.67 |
4 | 1 | 4/20/2021 | 2 | 3.0 |
5 | 1 | 5/01/2021 | 5 | 3.33 |
6 | 2 | 1/01/2021 | 1 | ... |
7 | 2 | 2/01/2021 | 1 | ... |
8 | 2 | 3/01/2021 | 2 | ... |
9 | 2 | 4/01/2021 | 3 | ... |
您可以进行滚动计算:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=False)
df["PurchaseMeanLast90Days"] = (
(
df.groupby("CustId")
.rolling("90D", min_periods=1, on="Date", closed="both")["Purchases"]
.apply(lambda x: x.shift(1).sum() / (len(x) - 1))
)
.fillna(0)
.values
)
print(df)
打印:
Index CustId Date Purchases PurchaseMeanLast90Days
0 0 1 2021-01-01 5 0.000000
1 1 1 2021-01-12 1 5.000000
2 2 1 2021-03-28 2 3.000000
3 3 1 2021-04-01 4 2.666667
4 4 1 2021-04-20 2 3.000000
5 5 1 2021-05-01 5 2.666667
6 6 2 2021-01-01 1 0.000000
7 7 2 2021-02-01 1 1.000000
8 8 2 2021-03-01 2 1.000000
9 9 2 2021-04-01 3 1.333333