Pandas apply() 自定义函数使用多列作为 "input"

Pandas apply() custom function using more than one column as "input"

也许查看这个简单的示例会帮助您理解我尝试做的事情:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,50,70,40], "B": [20,30,10,15,20,30]})


def _custom_function(X):    
    # whatever... just for the purpose of the example
    # but I need X to be the actual df and not a series

    Y = sum((X['A'] / X['B']) + (0.2 * X['B']))   
    return Y


df['C'] = df.rolling(2).apply(_custom_function, axis=0)

调用自定义函数时,X是Series类型,只是df的第一列。是否可以通过 apply 函数传递 df?

编辑:可以使用 rolling().apply():

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,50,70,40], "B": [20,30,10,15,20,30]})


def _custom_function(X):    
    # whatever... just for the purpose of the example
    Y = sum(0.2 * X)    
    return Y


df['C'] = df['A'].rolling(2).apply(_custom_function)

第二次编辑:带有滚动的列表理解未按预期运行

for x in df.rolling(3):
    print(x)

正如您在下面的示例中看到的,两种方法都不会给出相同的输出:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,50,70,40], "B": [20,30,10,15,20,30]})
df['C'] = 0.2


def _custom_function_df(X):    
    # whatever... just for the purpose of the example
    # but I need X to be the actual df and not a series
    Y = sum(X['C'] * X['B'])
    return Y

def _custom_function_series(X):    
    # whatever... just for the purpose of the example
    # but I need X to be the actual df and not a series
    Y = sum(0.2 * X)
    return Y


df['result'] = df['B'].rolling(3).apply(_custom_function_series)

df['result2'] = [x.pipe(_custom_function_df) for x in df.rolling(3, min_periods=3)]

滚动输出第一行的列表理解(没有预期的 NaN),但仅在 len(x) = 3 后才开始正确滚动,滚动 window.

提前致谢!

将 DataFrame 传递给函数:

df['C'] = _custom_function(df)

或使用DataFrame.pipe:

df['C'] = df.pipe(_custom_function)

print (df)
    A   B         C
0  10  20  4.500000
1  20  30  6.666667
2  30  10  5.000000
3  50  15  6.333333
4  70  20  7.500000
5  40  30  7.333333

编辑:Rolling.apply 由每一列分别处理,所以不能在这里使用。

可能的解决方案:

df['C'] = [x.pipe(_custom_function) for x in df.rolling(2)]
print (df)
    A   B          C
0  10  20   4.500000
1  20  30  11.166667
2  30  10  11.666667
3  50  15  11.333333
4  70  20  13.833333
5  40  30  14.833333

编辑:如果看起来有错误,默认 rollingmin_periods=1 一样工作。

这是解决方案(hack):

df['result'] = df['B'].rolling(3).apply(_custom_function_series)

df['result2']=[x.pipe(_custom_function_df) if len(x)==3 else np.nan for x in df.rolling(3)]

print (df)
    A   B    C  result  result2
0  10  20  0.2     NaN      NaN
1  20  30  0.2     NaN      NaN
2  30  10  0.2    12.0     12.0
3  50  15  0.2    11.0     11.0
4  70  20  0.2     9.0      9.0
5  40  30  0.2    13.0     13.0