Python/Pandas - 创建一个新列,仅显示每个组的最大值的平均值
Python/Pandas - Creating a new Column showing Average of only the Largest value for each group
我正在处理一个数据集,我正在尝试创建一个新列来显示一行中每个 ID 标签的平均数,但仅基于最后一行,这是 ID 组中最大的数字.示例如下。
我当前的数据集:
ID Date DaysInDuration
NCA 11/19/2019 31
NCA 12/19/2019 62
NCA 12/19/2019 92
NCA 1/19/2020 120 * Last Row
DTT 11/19/2019 31
DTT 12/19/2019 62
DTT 12/19/2019 92
DTT 1/19/2020 100 * Last Row
我正在尝试创建这个:
ID Date DaysInDuration AverageDurColumn *is only based off last row numb
NCA 11/19/2019 31 30
NCA 12/19/2019 62 30
NCA 12/19/2019 92 30
NCA 1/19/2020 120 * Last Row 30
DTT 11/19/2019 31 25
DTT 12/19/2019 62 25
DTT 12/19/2019 92 25
DTT 12/29/2020 100 * Last Row 25
感谢所有能提供帮助的人!
我们可以在这里使用 GroupBy.transform
与 last
和 size
:
grp = df.groupby('ID')
last = grp['DaysInDuration'].transform('last')
n = grp['DaysInDuration'].transform('size')
df['AverageDurColumn'] = last / n
ID Date DaysInDuration AverageDurColumn
0 NCA 11/19/2019 31 30.0
1 NCA 12/19/2019 62 30.0
2 NCA 12/19/2019 92 30.0
3 NCA 1/19/2020 120 30.0
4 DTT 11/19/2019 31 25.0
5 DTT 12/19/2019 62 25.0
6 DTT 12/19/2019 92 25.0
7 DTT 1/19/2020 100 25.0
您可以使用 groupby
、apply
和 merge
:
new_df = df.merge(
df
.groupby(['ID'])
.apply(lambda x: x['DaysInDuration'].max() / len(x['DaysInDuration'])
.reset_index(),
how='outer',
on='ID',
)
尝试:
import numpy as np
df["AverageDurColumn"]=np.where(df["ID"].ne(df["ID"].shift(-1)), df["DaysInDuration"], 0)
df=df.set_index("ID")
df["AverageDurColumn"]=df.groupby("ID")["AverageDurColumn"].mean()
df=df.reset_index()
输出:
ID Date DaysInDuration AverageDurColumn
0 NCA 11/19/2019 31 30
1 NCA 12/19/2019 62 30
2 NCA 12/19/2019 92 30
3 NCA 1/19/2020 120 30
4 DTT 11/19/2019 31 25
5 DTT 12/19/2019 62 25
6 DTT 12/19/2019 92 25
7 DTT 1/19/2020 100 25
另一个解决方案:
df["AverageDurColumn"]=df.groupby("ID").DaysInDuration.transform(lambda s: s.iloc[-1]/s.size)
这里有一个简单的答案:
df['answer'] = df.groupby('ID')['DaysInDuration'].transform(lambda x: x.max()/x.count())
我只是把你的问题变成了"How do I take the maximum value per ID and divide it by the number of records that ID has?"
1.First 按 ID 分组
2.Get 每个 ID 的最大值
3.Divide 按该 ID 的记录数
4.Use transform 将其应用于行
ID Date DaysInDuration answer
0 NCA 11/19/2019 31 30
1 NCA 12/19/2019 62 30
2 NCA 12/19/2019 92 30
3 NCA 1/19/2020 120 30
4 DTT 11/19/2019 31 25
5 DTT 12/19/2019 62 25
6 DTT 12/19/2019 92 25
7 DTT 1/19/2020 100 25
我正在处理一个数据集,我正在尝试创建一个新列来显示一行中每个 ID 标签的平均数,但仅基于最后一行,这是 ID 组中最大的数字.示例如下。
我当前的数据集:
ID Date DaysInDuration
NCA 11/19/2019 31
NCA 12/19/2019 62
NCA 12/19/2019 92
NCA 1/19/2020 120 * Last Row
DTT 11/19/2019 31
DTT 12/19/2019 62
DTT 12/19/2019 92
DTT 1/19/2020 100 * Last Row
我正在尝试创建这个:
ID Date DaysInDuration AverageDurColumn *is only based off last row numb
NCA 11/19/2019 31 30
NCA 12/19/2019 62 30
NCA 12/19/2019 92 30
NCA 1/19/2020 120 * Last Row 30
DTT 11/19/2019 31 25
DTT 12/19/2019 62 25
DTT 12/19/2019 92 25
DTT 12/29/2020 100 * Last Row 25
感谢所有能提供帮助的人!
我们可以在这里使用 GroupBy.transform
与 last
和 size
:
grp = df.groupby('ID')
last = grp['DaysInDuration'].transform('last')
n = grp['DaysInDuration'].transform('size')
df['AverageDurColumn'] = last / n
ID Date DaysInDuration AverageDurColumn
0 NCA 11/19/2019 31 30.0
1 NCA 12/19/2019 62 30.0
2 NCA 12/19/2019 92 30.0
3 NCA 1/19/2020 120 30.0
4 DTT 11/19/2019 31 25.0
5 DTT 12/19/2019 62 25.0
6 DTT 12/19/2019 92 25.0
7 DTT 1/19/2020 100 25.0
您可以使用 groupby
、apply
和 merge
:
new_df = df.merge(
df
.groupby(['ID'])
.apply(lambda x: x['DaysInDuration'].max() / len(x['DaysInDuration'])
.reset_index(),
how='outer',
on='ID',
)
尝试:
import numpy as np
df["AverageDurColumn"]=np.where(df["ID"].ne(df["ID"].shift(-1)), df["DaysInDuration"], 0)
df=df.set_index("ID")
df["AverageDurColumn"]=df.groupby("ID")["AverageDurColumn"].mean()
df=df.reset_index()
输出:
ID Date DaysInDuration AverageDurColumn
0 NCA 11/19/2019 31 30
1 NCA 12/19/2019 62 30
2 NCA 12/19/2019 92 30
3 NCA 1/19/2020 120 30
4 DTT 11/19/2019 31 25
5 DTT 12/19/2019 62 25
6 DTT 12/19/2019 92 25
7 DTT 1/19/2020 100 25
另一个解决方案:
df["AverageDurColumn"]=df.groupby("ID").DaysInDuration.transform(lambda s: s.iloc[-1]/s.size)
这里有一个简单的答案:
df['answer'] = df.groupby('ID')['DaysInDuration'].transform(lambda x: x.max()/x.count())
我只是把你的问题变成了"How do I take the maximum value per ID and divide it by the number of records that ID has?"
1.First 按 ID 分组
2.Get 每个 ID 的最大值
3.Divide 按该 ID 的记录数
4.Use transform 将其应用于行
ID Date DaysInDuration answer
0 NCA 11/19/2019 31 30
1 NCA 12/19/2019 62 30
2 NCA 12/19/2019 92 30
3 NCA 1/19/2020 120 30
4 DTT 11/19/2019 31 25
5 DTT 12/19/2019 62 25
6 DTT 12/19/2019 92 25
7 DTT 1/19/2020 100 25