优化将 if 函数应用于数据框,我这样做的速度慢吗? (Python, Pandas)

Optimising applying a if function to dataframe, am I doing it the slow way? (Python, Pandas)

很长一段时间以来的第一个问题,因为最近 Python 一直在工作。我一直在用 pandas 清理/准备一些数据,我发现当将函数应用于总数据(~30000000 行)的较小样本(500000 行)时,它需要很长时间运行 我的代码的特定块的时间(~8 分钟)。我的想法是,我已经写了一些有用的东西,但对于我想做的事情来说并不是最优的,而且当应用于整个数据集时,它将成为一个非常漫长的过程。我不完全确定,但我认为 运行ning 这种东西是像 alteryx 这样的程序会快得多,所以我想我一定做错了什么。非常感谢任何有助于加快速度的帮助或想法!

数据框示例:

po_data = pd.DataFrame({'Order Quantity Received Type':['Order Cancelled - None Received','Order Partially Fulfilled'],Order Quantity Change Type':['Order Cancelled','Increased','c'],'Received Quantity':[0,3],Current Order Quantity:[0,5]})

函数:

def order_quantity_received(df,output_col,cancelled,received_quant,ordered_quant):
    if (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
        df[output_col] = "Order Cancelled - None Received"
    elif (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
        df[output_col] = "Order Cancelled - Items Received"
    elif df[received_quant] > df[ordered_quant]:
        df[output_col] = "Order Over Fufilled"
    elif (df[received_quant] < df[ordered_quant]) & (df[received_quant] > 0):
        df[output_col] = "Order Partially Fufilled"
    elif df[received_quant] == df[ordered_quant]:
        df[output_col] = "Order Fully Fufilled"
    elif (df[received_quant] == 0) & (df[ordered_quant] > 0):
        df[output_col] = "Order Not Fufilled"
    else:
        df[output_col] = "Error"
    return df

函数调用:

po_data = po_data.apply(lambda po_data: order_quantity_received(po_data,'Order Quantity Received Type','Order Quantity Change Type','Received Quantity','Current Order Quantity'),axis=1)

使用 Pandas 和 Numpy 的最快方法是向量化您的函数。 运行 使用 for 循环、列表理解或 apply() 沿数组或系列逐个元素地运行函数是一种不好的做法。

我只想举一个“已取消订单”的例子:

def order_cancelled(a, b):
    ## define your function logic however you want
    return a - b

然后向量化你的函数:

df['output_col'] = np.vectorize(order_cancelled)(df['cancelled'], df['received_quant'])