如何提高python的计算速度?
How to improve the calculation's speed in python?
我正在构建计算以将新列添加到我的数据框中。这是我的数据:
我需要创建一个新列“mob”。 "mob"的计算方式是
- 如果某一行的“LoanID”与上一行的“LoanID”相同。例如,如果贷款['LoanId'][0] = 贷款['LoanId'] 1;
- 如果前一行的“mob”>0;如果是,则该行的“mob”值将从上一行的值加 1;如果不是,请尝试该行的 loan['repay_lbl'] 是 1 还是 2,如果是,则该行的 "mob" 值将是 1;
我的代码如下:
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
该代码的成本为 O(n)。有什么办法可以改进算法,加快速度吗?
我只是 Python 的初学者。非常感谢您的帮助。
由于每行的 mob
列的值取决于前一行的值,因此它取决于 所有先前的行。这意味着你不能 运行 这并行,你基本上坚持 O(n)
.
所以我不认为 numpy 数组操作在这里会有多大用处。
否则,有一些常用的技巧可以加快 Python 代码;
我不确定前两个是否适用于 numpy/pandas。在这些情况下,您可能必须为数据使用普通 Python 列表。
当然,在您深入研究其中任何一个之前,您应该考虑您的数据集是否足够大以保证付出努力。
通过改变循环方法改进时间
基于
改进循环时间
- 在不广播的情况下遍历所有 N 行,因此复杂度为 O(N)
- 虽然都是N阶,但不同的循环方式有不同的复杂度比例因子
- 不同的比例因子使一些方法比其他方法快得多
灵感来自 - Different ways to iterate over rows in a Pandas Dataframe — performance comparison
方法
- For循环--原来post
- iterrows
- itertuples
- zip
总结
对于 10 万行,zip 方法比 for 循环(即 OP 方法)快 93 倍
测试代码
import pandas as pd
import numpy as np
from random import randint
def create_input(N):
' Creates a loan DataFrame with N rows '
LoanId = [randint(0, N //4) for _ in range(N)] # though random, N//4 ensures
# high likelihood some rows repeat
# LoanID
repay_lbl = [randint(0, 2) for _ in range(N)]
data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
return pd.DataFrame(data)
def m_itertuples(loan):
' Iterating using itertuples, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row.LoanId:
if prev_mob > 0:
loan.at[row.Index, 'mob'] = prev_mob + 1
elif row.repay_lbl == 1 or row.repay_lbl == 2:
loan.at[row.Index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_for_loop(loan):
' For loop over the data frame '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
return loan
def m_iterrows(loan):
' Iterating using iterrows, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in loan.iterrows(): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row['LoanId']:
if prev_mob > 0:
loan.at[index, 'mob'] = prev_mob + 1
elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
loan.at[index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_zip(loan):
' Iterating using zip, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
if prev_loanID is not None:
if prev_loanID == loanID:
if prev_mob > 0:
mob = loan.at[index, 'mob'] = prev_mob + 1
elif repay_lbl == 1 or repay_lbl == 2:
mob = loan.at[index, 'mob'] = 1
# Update to latest values
prev_loanID, prev_mob = loanID, mob
return loan
注意:迭代器代码查询数据帧以获取更新数据,而不是从迭代器获取 warning:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
还比较了使用 assert df1.equals(df2)
的 DataFrame 以验证不同的方法产生了相同的结果
时序码
使用benchit
inputs = [create_input(i) for i in 10**np.arange(6)] # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]
t = benchit.timings(funcs, inputs)
结果
运行 时间(以秒为单位)
Functions m_for_loop m_iterrows m_itertuples m_zip
Len
1 0.000217 0.000493 0.000781 0.000327
10 0.001070 0.002002 0.001008 0.000353
100 0.007100 0.016501 0.003062 0.000498
1000 0.056940 0.162423 0.021396 0.001057
10000 0.565809 1.625043 0.210858 0.006938
100000 5.890920 16.658842 2.179602 0.062953
我正在构建计算以将新列添加到我的数据框中。这是我的数据:
我需要创建一个新列“mob”。 "mob"的计算方式是
- 如果某一行的“LoanID”与上一行的“LoanID”相同。例如,如果贷款['LoanId'][0] = 贷款['LoanId'] 1;
- 如果前一行的“mob”>0;如果是,则该行的“mob”值将从上一行的值加 1;如果不是,请尝试该行的 loan['repay_lbl'] 是 1 还是 2,如果是,则该行的 "mob" 值将是 1;
我的代码如下:
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
该代码的成本为 O(n)。有什么办法可以改进算法,加快速度吗? 我只是 Python 的初学者。非常感谢您的帮助。
由于每行的 mob
列的值取决于前一行的值,因此它取决于 所有先前的行。这意味着你不能 运行 这并行,你基本上坚持 O(n)
.
所以我不认为 numpy 数组操作在这里会有多大用处。
否则,有一些常用的技巧可以加快 Python 代码;
我不确定前两个是否适用于 numpy/pandas。在这些情况下,您可能必须为数据使用普通 Python 列表。
当然,在您深入研究其中任何一个之前,您应该考虑您的数据集是否足够大以保证付出努力。
通过改变循环方法改进时间
基于
改进循环时间- 在不广播的情况下遍历所有 N 行,因此复杂度为 O(N)
- 虽然都是N阶,但不同的循环方式有不同的复杂度比例因子
- 不同的比例因子使一些方法比其他方法快得多
灵感来自 - Different ways to iterate over rows in a Pandas Dataframe — performance comparison
方法
- For循环--原来post
- iterrows
- itertuples
- zip
总结
对于 10 万行,zip 方法比 for 循环(即 OP 方法)快 93 倍
测试代码
import pandas as pd
import numpy as np
from random import randint
def create_input(N):
' Creates a loan DataFrame with N rows '
LoanId = [randint(0, N //4) for _ in range(N)] # though random, N//4 ensures
# high likelihood some rows repeat
# LoanID
repay_lbl = [randint(0, 2) for _ in range(N)]
data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
return pd.DataFrame(data)
def m_itertuples(loan):
' Iterating using itertuples, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row.LoanId:
if prev_mob > 0:
loan.at[row.Index, 'mob'] = prev_mob + 1
elif row.repay_lbl == 1 or row.repay_lbl == 2:
loan.at[row.Index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_for_loop(loan):
' For loop over the data frame '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
return loan
def m_iterrows(loan):
' Iterating using iterrows, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in loan.iterrows(): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row['LoanId']:
if prev_mob > 0:
loan.at[index, 'mob'] = prev_mob + 1
elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
loan.at[index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_zip(loan):
' Iterating using zip, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
if prev_loanID is not None:
if prev_loanID == loanID:
if prev_mob > 0:
mob = loan.at[index, 'mob'] = prev_mob + 1
elif repay_lbl == 1 or repay_lbl == 2:
mob = loan.at[index, 'mob'] = 1
# Update to latest values
prev_loanID, prev_mob = loanID, mob
return loan
注意:迭代器代码查询数据帧以获取更新数据,而不是从迭代器获取 warning:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
还比较了使用 assert df1.equals(df2)
的 DataFrame 以验证不同的方法产生了相同的结果
时序码
使用benchit
inputs = [create_input(i) for i in 10**np.arange(6)] # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]
t = benchit.timings(funcs, inputs)
结果
运行 时间(以秒为单位)
Functions m_for_loop m_iterrows m_itertuples m_zip
Len
1 0.000217 0.000493 0.000781 0.000327
10 0.001070 0.002002 0.001008 0.000353
100 0.007100 0.016501 0.003062 0.000498
1000 0.056940 0.162423 0.021396 0.001057
10000 0.565809 1.625043 0.210858 0.006938
100000 5.890920 16.658842 2.179602 0.062953