如何提高python的计算速度?

How to improve the calculation's speed in python?

我正在构建计算以将新列添加到我的数据框中。这是我的数据:

我需要创建一个新列“mob”。 "mob"的计算方式是

  1. 如果某一行的“LoanID”与上一行的“LoanID”相同。例如,如果贷款['LoanId'][0] = 贷款['LoanId'] 1;
  2. 如果前一行的“mob”>0;如果是,则该行的“mob”值将从上一行的值加 1;如果不是,请尝试该行的 loan['repay_lbl'] 是 1 还是 2,如果是,则该行的 "mob" 值将是 1;

我的代码如下:

for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
    if loan['mob'][i-1] > 0:
        loan['mob'][i] = loan['mob'][i-1] +1 
    elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
        loan['mob'][i] = 1

该代码的成本为 O(n)。有什么办法可以改进算法,加快速度吗? 我只是 Python 的初学者。非常感谢您的帮助。

由于每行的 mob 列的值取决于前一行的值,因此它取决于 所有先前的行。这意味着你不能 运行 这并行,你基本上坚持 O(n).

所以我不认为 numpy 数组操作在这里会有多大用处。

否则,有一些常用的技巧可以加快 Python 代码;

我不确定前两个是否适用于 numpy/pandas。在这些情况下,您可能必须为数据使用普通 Python 列表。

当然,在您深入研究其中任何一个之前,您应该考虑您的数据集是否足够大以保证付出努力。

通过改变循环方法改进时间

基于

改进循环时间
  • 在不广播的情况下遍历所有 N 行,因此复杂度为 O(N)
  • 虽然都是N阶,但不同的循环方式有不同的复杂度比例因子
  • 不同的比例因子使一些方法比其他方法快得多

灵感来自 - Different ways to iterate over rows in a Pandas Dataframe — performance comparison

方法

  1. For循环--原来post
  2. iterrows
  3. itertuples
  4. zip

总结

对于 10 万行,zip 方法比 for 循环(即 OP 方法)快 93 倍

测试代码

import pandas as pd
import numpy as np
from random import randint

def create_input(N):
    ' Creates a loan DataFrame with N rows '
    LoanId = [randint(0, N //4) for _ in range(N)]  # though random, N//4 ensures
                                                    # high likelihood some rows repeat
                                                    # LoanID
    repay_lbl = [randint(0, 2) for _ in range(N)]

    data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
    return pd.DataFrame(data)

def m_itertuples(loan):
    ' Iterating using itertuples, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row.LoanId:
                if prev_mob > 0:
                    loan.at[row.Index, 'mob'] = prev_mob + 1 
                elif row.repay_lbl == 1 or row.repay_lbl == 2:
                    loan.at[row.Index, 'mob'] = 1
            
        # Query for latest values   
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
                    
    return loan
    
def m_for_loop(loan):
    ' For loop over the data frame '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
            
    for i in range(1,len(loan['LoanId'])):
        if loan['LoanId'][i-1] == loan['LoanId'][i]:
            if loan['mob'][i-1] > 0:
                loan['mob'][i] = loan['mob'][i-1] +1 
            elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
                loan['mob'][i] = 1
    return loan

def m_iterrows(loan):
    ' Iterating using iterrows, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in loan.iterrows(): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row['LoanId']:
                if prev_mob > 0:
                    loan.at[index, 'mob'] = prev_mob + 1 
                elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
                    loan.at[index, 'mob'] = 1
                    
        # Query for latest values          
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
        
    return loan

def m_zip(loan):
    ' Iterating using zip, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob  = None, None
    for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
        if prev_loanID is not None:
             if prev_loanID == loanID:
                if prev_mob > 0:
                    mob = loan.at[index, 'mob'] = prev_mob + 1
                elif repay_lbl == 1 or repay_lbl == 2:
                    mob = loan.at[index, 'mob'] = 1
        
        # Update to latest values
        prev_loanID, prev_mob = loanID, mob
        
    return loan

注意:迭代器代码查询数据帧以获取更新数据,而不是从迭代器获取 warning:

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

还比较了使用 assert df1.equals(df2) 的 DataFrame 以验证不同的方法产生了相同的结果

时序码

使用benchit

inputs = [create_input(i) for i in 10**np.arange(6)]  # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]

t = benchit.timings(funcs, inputs)

结果

运行 时间(以秒为单位)

Functions  m_for_loop  m_iterrows  m_itertuples     m_zip
Len                                                      
1            0.000217    0.000493      0.000781  0.000327
10           0.001070    0.002002      0.001008  0.000353
100          0.007100    0.016501      0.003062  0.000498
1000         0.056940    0.162423      0.021396  0.001057
10000        0.565809    1.625043      0.210858  0.006938
100000       5.890920   16.658842      2.179602  0.062953