是否有更有效的方法来将特定值的总 .count() 值带到 Dataframe 中的每一行？（没有合并，最好使用 lambda）

Question

新来的，我的第一个post，请耐心等待:)。

我想在我的 df 中添加一个新列，其中每行 ("Parent_contract")（重复编号）下的合同编号 ["Child_contract"]（唯一编号）的总数。

我的以下语句完成了这项工作，但在我当前的 df 上处理它需要相当多的时间。

df["Total_count"] = df.apply(lambda x: df.groupby("Parent_contract")["Child_contract"].count().to_frame().loc[x["Parent_contract"]],axis=1)

非常感谢任何回复。明确一点，我想修改 df 而不是过滤它。

Answer 1

您可以通过更改为：

来加快当前解决方案的速度

fr = df.groupby("Parent_contract")["Child_contract"].count().to_frame()
df["Total_count"] = df.apply(lambda x: fr.loc[x["Parent_contract"]],axis=1)

这提供了超过 3 倍的速度提升。

说明

在您当前的解决方案中 df.groupby("Parent_contract")["Child_contract"].count().to_frame() 是为每一行重新计算。
预先计算并分配给 fr 避免这种重复工作。

性能

测试代码

from random import randint

def generate_data(min_rows):
    ' Generate dataframe with parent and child columns '
    parent_idx, child_idx = 1, 1

    d = {'Parent_contract':[],
         'Child_contract': []}

    while len(d['Parent_contract']) < min_rows:
      add = randint(1, 5)
      for _ in range(add):
          d['Parent_contract'].append(f"p{parent_idx}")
      parent_idx += 1
      for _ in range(add):
          d['Child_contract'].append(f"c{child_idx}")
          child_idx += 1
          
    return pd.DataFrame(d)

def posted_method(df):
    ' Posted method '
    df["Total_count"] = df.apply(lambda x: df.groupby("Parent_contract")["Child_contract"].count().to_frame().loc[x["Parent_contract"]],axis=1)
    return df
                                 
def suggested_method(df):
    ' Suggested Method '
    fr = df.groupby("Parent_contract")["Child_contract"].count().to_frame()
    df["Total_count"] = df.apply(lambda x: fr.loc[x["Parent_contract"]],axis=1)
    return df

计时（使用timeit）

nRows    Posted Method   Suggested Method   Speed Up
10        18.1 ms            6.37 ms          2.8
1,000     2.05 s             289 ms           7.1
10,000    1 min 5 s          3.01 s           21.5

是否有更有效的方法来将特定值的总 .count() 值带到 Dataframe 中的每一行？（没有合并，最好使用 lambda）

Is there a more efficient way to bring total .count() values of specific value on each row in Dataframe? (without merge, with lambda preferably)

python

lambda

apply

pandas

pandas-groupby

是否有更有效的方法来将特定值的总 .count() 值带到 Dataframe 中的每一行？ （没有合并，最好使用 lambda）

Is there a more efficient way to bring total .count() values of specific value on each row in Dataframe? (without merge, with lambda preferably)

python

lambda

apply

pandas

pandas-groupby

是否有更有效的方法来将特定值的总 .count() 值带到 Dataframe 中的每一行？（没有合并，最好使用 lambda）