Groupby 并使用自定义函数执行逐行计算

Question

从这个问题开始：

我有一个 pandas 数据框如下：

col_1   col_2   col_3  col_4
a       X        5      1
a       Y        3      2
a       Z        6      4
b       X        7      8
b       Y        4      3
b       Z        6      5

我想为 col_1 中的每个值应用一个函数，其中 col_3 和 col_4（以及更多列）中的值对应于 X 和 Z来自 col_2 并使用这些值创建一个新行。所以输出如下：

col_1   col_2   col_3  col_4 
a       X        5      1
a       Y        3      2
a       Z        6      4
a       NEW      *      *
b       X        7      8
b       Y        4      3
b       Z        6      5
b       NEW      *      *

其中 * 是函数的输出。

原问题（只需要简单添加）的回答是：

new = df[df.col_2.isin(['X', 'Z'])]\
  .groupby(['col_1'], as_index=False).sum()\
  .assign(col_2='NEW')

df = pd.concat([df, new]).sort_values('col_1')

我现在正在寻找一种使用自定义函数的方法，例如 (X/Y) 或 ((X+Y)*2)，而不是 X+Y。如何修改此代码以满足我的新要求？

Answer 1

我不确定这是否是您要查找的内容，但这里是：

def f(x):
    y = x.values
    return y[0] / y[1] # replace with your function

而且，new 的变化是：

new = (
    df[df.col_2.isin(['X', 'Z'])]
      .groupby(['col_1'], as_index=False)[['col_3', 'col_4']]
      .agg(f)
      .assign(col_2='NEW')
)

  col_1     col_3  col_4 col_2
0     a  0.833333   0.25   NEW
1     b  1.166667   1.60   NEW

df = pd.concat([df, new]).sort_values('col_1')

df
  col_1 col_2     col_3  col_4
0     a     X  5.000000   1.00
1     a     Y  3.000000   2.00
2     a     Z  6.000000   4.00
0     a   NEW  0.833333   0.25
3     b     X  7.000000   8.00
4     b     Y  4.000000   3.00
5     b     Z  6.000000   5.00
1     b   NEW  1.166667   1.60

我对 f 抱有信心，并假设这些列在调用函数之前已排序。如果不是这种情况，则需要额外的 sort_values 调用：

df = df.sort_values(['col_1, 'col_2'])

应该可以解决问题。

Answer 2

def foo(df):
    # Expand variables into dictionary.
    d = {v: df.loc[df['col_2'] == v, ['col_3', 'col_4']] for v in df['col_2'].unique()}

    # Example function: (X + Y ) * 2
    result = (d['X'].values + d['Y'].values) * 2

    # Convert result to a new dataframe row.
    result = result.tolist()[0]
    df_new = pd.DataFrame(
        {'col_1': [df['col_1'].iat[0]], 
         'col_2': ['NEW'], 
         'col_3': result[0],
         'col_4': result[1]})
    # Concatenate result with original dataframe for group and return.
    return pd.concat([df, df_new])

>>> df.groupby('col_1').apply(lambda x: foo(x)).reset_index(drop=True)
  col_1 col_2  col_3  col_4
0     a     X      5      1
1     a     Y      3      2
2     a     Z      6      4
3     a   NEW     16      6
4     b     X      7      8
5     b     Y      4      3
6     b     Z      6      5
7     b   NEW     22     22

Answer 3

一种更新的方法（应该提供性能优势）是使用 PyArrow 和 pandas_udf 来支持矢量化操作，如 Spark 2.4 中所述：PySpark Usage Guide for Pandas with Apache Arrow

Groupby 并使用自定义函数执行逐行计算

Groupby and perform row-wise calculation using a custom function

python

group-by

dataframe

pandas

pandas-groupby