优化将 if 函数应用于数据框,我这样做的速度慢吗? (Python, Pandas)
Optimising applying a if function to dataframe, am I doing it the slow way? (Python, Pandas)
很长一段时间以来的第一个问题,因为最近 Python 一直在工作。我一直在用 pandas 清理/准备一些数据,我发现当将函数应用于总数据(~30000000 行)的较小样本(500000 行)时,它需要很长时间运行 我的代码的特定块的时间(~8 分钟)。我的想法是,我已经写了一些有用的东西,但对于我想做的事情来说并不是最优的,而且当应用于整个数据集时,它将成为一个非常漫长的过程。我不完全确定,但我认为 运行ning 这种东西是像 alteryx 这样的程序会快得多,所以我想我一定做错了什么。非常感谢任何有助于加快速度的帮助或想法!
数据框示例:
po_data = pd.DataFrame({'Order Quantity Received Type':['Order Cancelled - None Received','Order Partially Fulfilled'],Order Quantity Change Type':['Order Cancelled','Increased','c'],'Received Quantity':[0,3],Current Order Quantity:[0,5]})
函数:
def order_quantity_received(df,output_col,cancelled,received_quant,ordered_quant):
if (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - None Received"
elif (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - Items Received"
elif df[received_quant] > df[ordered_quant]:
df[output_col] = "Order Over Fufilled"
elif (df[received_quant] < df[ordered_quant]) & (df[received_quant] > 0):
df[output_col] = "Order Partially Fufilled"
elif df[received_quant] == df[ordered_quant]:
df[output_col] = "Order Fully Fufilled"
elif (df[received_quant] == 0) & (df[ordered_quant] > 0):
df[output_col] = "Order Not Fufilled"
else:
df[output_col] = "Error"
return df
函数调用:
po_data = po_data.apply(lambda po_data: order_quantity_received(po_data,'Order Quantity Received Type','Order Quantity Change Type','Received Quantity','Current Order Quantity'),axis=1)
使用 Pandas 和 Numpy 的最快方法是向量化您的函数。 运行 使用 for 循环、列表理解或 apply() 沿数组或系列逐个元素地运行函数是一种不好的做法。
我只想举一个“已取消订单”的例子:
def order_cancelled(a, b):
## define your function logic however you want
return a - b
然后向量化你的函数:
df['output_col'] = np.vectorize(order_cancelled)(df['cancelled'], df['received_quant'])
很长一段时间以来的第一个问题,因为最近 Python 一直在工作。我一直在用 pandas 清理/准备一些数据,我发现当将函数应用于总数据(~30000000 行)的较小样本(500000 行)时,它需要很长时间运行 我的代码的特定块的时间(~8 分钟)。我的想法是,我已经写了一些有用的东西,但对于我想做的事情来说并不是最优的,而且当应用于整个数据集时,它将成为一个非常漫长的过程。我不完全确定,但我认为 运行ning 这种东西是像 alteryx 这样的程序会快得多,所以我想我一定做错了什么。非常感谢任何有助于加快速度的帮助或想法!
数据框示例:
po_data = pd.DataFrame({'Order Quantity Received Type':['Order Cancelled - None Received','Order Partially Fulfilled'],Order Quantity Change Type':['Order Cancelled','Increased','c'],'Received Quantity':[0,3],Current Order Quantity:[0,5]})
函数:
def order_quantity_received(df,output_col,cancelled,received_quant,ordered_quant):
if (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - None Received"
elif (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - Items Received"
elif df[received_quant] > df[ordered_quant]:
df[output_col] = "Order Over Fufilled"
elif (df[received_quant] < df[ordered_quant]) & (df[received_quant] > 0):
df[output_col] = "Order Partially Fufilled"
elif df[received_quant] == df[ordered_quant]:
df[output_col] = "Order Fully Fufilled"
elif (df[received_quant] == 0) & (df[ordered_quant] > 0):
df[output_col] = "Order Not Fufilled"
else:
df[output_col] = "Error"
return df
函数调用:
po_data = po_data.apply(lambda po_data: order_quantity_received(po_data,'Order Quantity Received Type','Order Quantity Change Type','Received Quantity','Current Order Quantity'),axis=1)
使用 Pandas 和 Numpy 的最快方法是向量化您的函数。 运行 使用 for 循环、列表理解或 apply() 沿数组或系列逐个元素地运行函数是一种不好的做法。
我只想举一个“已取消订单”的例子:
def order_cancelled(a, b):
## define your function logic however you want
return a - b
然后向量化你的函数:
df['output_col'] = np.vectorize(order_cancelled)(df['cancelled'], df['received_quant'])