Python/Pandas - 创建一个新列，仅显示每个组的最大值的平均值

Question

我正在处理一个数据集，我正在尝试创建一个新列来显示一行中每个 ID 标签的平均数，但仅基于最后一行，这是 ID 组中最大的数字.示例如下。

我当前的数据集：

    ID      Date        DaysInDuration
    NCA   11/19/2019        31                 
    NCA   12/19/2019        62              
    NCA   12/19/2019        92             
    NCA   1/19/2020         120 * Last Row
    DTT   11/19/2019        31                 
    DTT   12/19/2019        62              
    DTT   12/19/2019        92             
    DTT   1/19/2020         100 * Last Row

我正在尝试创建这个：

    ID      Date        DaysInDuration          AverageDurColumn *is only based off last row numb 
    NCA   11/19/2019        31                        30
    NCA   12/19/2019        62                        30
    NCA   12/19/2019        92                        30
    NCA   1/19/2020         120 * Last Row            30
    DTT   11/19/2019        31                        25
    DTT   12/19/2019        62                        25
    DTT   12/19/2019        92                        25
    DTT   12/29/2020        100 * Last Row            25

感谢所有能提供帮助的人！

Answer 1

我们可以在这里使用 GroupBy.transform 与 last 和 size:

grp = df.groupby('ID')
last = grp['DaysInDuration'].transform('last')
n = grp['DaysInDuration'].transform('size')

df['AverageDurColumn'] = last / n

    ID        Date  DaysInDuration  AverageDurColumn
0  NCA  11/19/2019              31              30.0
1  NCA  12/19/2019              62              30.0
2  NCA  12/19/2019              92              30.0
3  NCA   1/19/2020             120              30.0
4  DTT  11/19/2019              31              25.0
5  DTT  12/19/2019              62              25.0
6  DTT  12/19/2019              92              25.0
7  DTT   1/19/2020             100              25.0

Answer 2

您可以使用 groupby、apply 和 merge:

new_df = df.merge(
  df
  .groupby(['ID'])
  .apply(lambda x: x['DaysInDuration'].max() / len(x['DaysInDuration'])
  .reset_index(),
  how='outer',
  on='ID',
)

Answer 3

尝试：

import numpy as np

df["AverageDurColumn"]=np.where(df["ID"].ne(df["ID"].shift(-1)), df["DaysInDuration"], 0)

df=df.set_index("ID")
df["AverageDurColumn"]=df.groupby("ID")["AverageDurColumn"].mean()
df=df.reset_index()

输出：

    ID        Date  DaysInDuration  AverageDurColumn
0  NCA  11/19/2019              31                30
1  NCA  12/19/2019              62                30
2  NCA  12/19/2019              92                30
3  NCA   1/19/2020             120                30
4  DTT  11/19/2019              31                25
5  DTT  12/19/2019              62                25
6  DTT  12/19/2019              92                25
7  DTT   1/19/2020             100                25

Answer 4

另一个解决方案：

df["AverageDurColumn"]=df.groupby("ID").DaysInDuration.transform(lambda s: s.iloc[-1]/s.size)

Answer 5

这里有一个简单的答案：

df['answer'] = df.groupby('ID')['DaysInDuration'].transform(lambda x: x.max()/x.count())

我只是把你的问题变成了"How do I take the maximum value per ID and divide it by the number of records that ID has?"

1.First 按 ID 分组

2.Get 每个 ID 的最大值

3.Divide 按该 ID 的记录数

4.Use transform 将其应用于行

    ID        Date  DaysInDuration  answer
0  NCA  11/19/2019              31      30
1  NCA  12/19/2019              62      30
2  NCA  12/19/2019              92      30
3  NCA   1/19/2020             120      30
4  DTT  11/19/2019              31      25
5  DTT  12/19/2019              62      25
6  DTT  12/19/2019              92      25
7  DTT   1/19/2020             100      25

Python/Pandas - 创建一个新列，仅显示每个组的最大值的平均值

Python/Pandas - Creating a new Column showing Average of only the Largest value for each group

python

average

pandas