Optimizing/Parallel 基于 Pandas 计算一个简单但大的循环
Optimizing/Parallel Computing a simple but big loop based on Pandas
我有这个处理大数据集的简单循环。
for i in range (len(nbi.LONG_017)):
StoredOCC = []
for j in range (len(Final.X)):
r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
if (r < 0.03048):
SWw = Final.CUM_OCC[j]
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)
len(nbi.LONG_017)
是 3000,len(Final.X)
是 600 万个数据点。
我想知道是否有有效的方法来实现此代码或使用并行计算使其更快?
我使用了此处提供的代码:
Haversine Formula in Python (Bearing and Distance between two GPS points) 我的函数 haversine
:
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles
return r * c
在讨论并行化之前,您可以着手优化循环。第一种方法是遍历数据而不是在长度上递增值,然后每次访问数据:
#toy sample
np.random.seed(1)
size_nbi = 20
size_Final = 100
nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/100+73,
'LAT_016':np.random.random(size=size_nbi)/100+73,})
Final = pd.DataFrame({'X':np.random.random(size=size_Final)/100+73,
'Y':np.random.random(size=size_Final)/100+73,
'CUM_OCC':np.random.randint(size_Final,size=size_Final)})
使用你的方法,你得到这些大小的数据帧大约 75 毫秒:
%%timeit
for i in range (len(nbi.LONG_017)):
StoredOCC = []
for j in range (len(Final.X)):
r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
if (r < 0.03048):
SWw = Final.CUM_OCC[j]
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)
# 75.6 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
现在,如果您稍微更改循环,遍历数据本身并使用列表作为结果,然后将结果作为循环外的列,您可以降低到 5 毫秒,因此只需优化即可快 15 倍它(你可能会发现更好的优化):
%%timeit
res_ADIT = []
for lon1, lat1 in zip (nbi.LONG_017.to_numpy(),
nbi.LAT_016.to_numpy()):
StoredOCC = []
for lon2, lat2, SWw in zip(Final.X.to_numpy(),
Final.Y.to_numpy(),
Final.CUM_OCC.to_numpy()):
r = haversine(lon1, lat1, lon2, lat2)
if (r < 0.03048):
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
res_ADIT.append(np.max(StoredOCC))
else:
res_ADIT.append(np.nan)
nbi['ADIT_v2'] = res_ADIT
# 5.23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
现在,您甚至可以更进一步,使用 numpy
和第二个循环的矢量化,还可以直接传递弧度值,而不必在每个循环中传递 map
它们:
# empty list for result
res_ADIT = []
# work with array in radians
arr_lon2 = np.radians(Final.X.to_numpy())
arr_lat2 = np.radians(Final.Y.to_numpy())
arr_OCC = Final.CUM_OCC.to_numpy()
for lon1, lat1 in zip (np.radians(nbi.LONG_017), #pass directly the radians
np.radians(nbi.LAT_016)):
# do all the substraction in a vectorize way
arr_dlon = arr_lon2 - lon1
arr_dlat = arr_lat2 - lat1
# same here using numpy functions
arr_dist = np.sin(arr_dlat/2)**2 + np.cos(lat1) * np.cos(arr_lat2) * np.sin(arr_dlon/2)**2
arr_dist = 2 * np.arcsin(np.sqrt(arr_dist))
arr_dist *= 6372.8
# extract the values of CUM_OCC that meet the criteria
r = arr_OCC[arr_dist<0.03048]
# check that at least one element
if r.size>0:
res_ADIT.append(max(r))
else :
res_ADIT.append(np.nan)
nbi['AUDIT_np'] = res_ADIT
如果你这样做 timeit
,你会下降到 819 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
所以比你原来的小数据帧解决方案快大约 90 倍并且所有结果都是相同的
print(nbi)
LONG_017 LAT_016 ADTT_02_2019 ADIT_v2 AUDIT_np
0 73.004170 73.008007 NaN NaN NaN
1 73.007203 73.009683 30.0 30.0 30.0
2 73.000001 73.003134 14.0 14.0 14.0
3 73.003023 73.006923 82.0 82.0 82.0
4 73.001468 73.008764 NaN NaN NaN
5 73.000923 73.008946 NaN NaN NaN
6 73.001863 73.000850 NaN NaN NaN
7 73.003456 73.000391 NaN NaN NaN
8 73.003968 73.001698 21.0 21.0 21.0
9 73.005388 73.008781 NaN NaN NaN
10 73.004192 73.000983 93.0 93.0 93.0
11 73.006852 73.004211 NaN NaN NaN
您可以尝试一下代码并增加每个玩具数据帧的大小,您将看到如何通过增加数据的大小,尤其是 Final 数据,增益变得有趣,例如 size_nbi = 20
并且 size_Final = 1000
,您使用矢量化解决方案的速度提高了 400 倍。等等。对于您的完整数据大小 (3K *6M),您仍然需要一些时间,在我的计算机上,我估计大约需要 25 分钟,而您的解决方案需要数百小时。现在,如果这还不够,您可以考虑并行化或也使用 numba
这种使用 Balltree 的方法在我的机器上大约需要一分钟(取决于半径,半径越小越快),您提到的尺寸(7000 和 6M)
import numpy as np
import sklearn
import pandas as pd
我生成数据,谢谢 Ben 我使用了你的代码
#toy sample
np.random.seed(1)
size_nbi = 7000
size_Final = 6000000
nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/10+73,
'LAT_016':np.random.random(size=size_nbi)/10+73,})
Final = pd.DataFrame({'X':np.random.random(size=size_Final)/10+73,
'Y':np.random.random(size=size_Final)/10+73,
'CUM_OCC':np.random.randint(size_Final,size=size_Final)})
nbi_gps = nbi[["LAT_016", "LONG_017"]].values
final_gps = Final[["Y", "X"]].values
创建 Balltree
%%time
from sklearn.neighbors import BallTree
import numpy as np
nbi = np.radians(nbi_gps)
final = np.radians(final_gps)
tree = BallTree(final, leaf_size=12, metric='haversine')
其中 Wall time: 23.8 s
查询
%%time
radius = 0.0000003
StoredOCC_indici = tree.query_radius(nbi, r=radius, return_distance=False, count_only=False)
(不到一秒)
并充分利用您感兴趣的事物
StoredOCC = [ np.max( Final.CUM_OCC[i] ) for i in StoredOCC_indici]
自动为空列表生成 Nan
s,这很好。这花了 Wall time: 3.64 s
为此 radius
在我的机器上对 600 万的整个计算花费了不到一分钟的时间。
我有这个处理大数据集的简单循环。
for i in range (len(nbi.LONG_017)):
StoredOCC = []
for j in range (len(Final.X)):
r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
if (r < 0.03048):
SWw = Final.CUM_OCC[j]
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)
len(nbi.LONG_017)
是 3000,len(Final.X)
是 600 万个数据点。
我想知道是否有有效的方法来实现此代码或使用并行计算使其更快?
我使用了此处提供的代码:
Haversine Formula in Python (Bearing and Distance between two GPS points) 我的函数 haversine
:
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles
return r * c
在讨论并行化之前,您可以着手优化循环。第一种方法是遍历数据而不是在长度上递增值,然后每次访问数据:
#toy sample
np.random.seed(1)
size_nbi = 20
size_Final = 100
nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/100+73,
'LAT_016':np.random.random(size=size_nbi)/100+73,})
Final = pd.DataFrame({'X':np.random.random(size=size_Final)/100+73,
'Y':np.random.random(size=size_Final)/100+73,
'CUM_OCC':np.random.randint(size_Final,size=size_Final)})
使用你的方法,你得到这些大小的数据帧大约 75 毫秒:
%%timeit
for i in range (len(nbi.LONG_017)):
StoredOCC = []
for j in range (len(Final.X)):
r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
if (r < 0.03048):
SWw = Final.CUM_OCC[j]
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)
# 75.6 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
现在,如果您稍微更改循环,遍历数据本身并使用列表作为结果,然后将结果作为循环外的列,您可以降低到 5 毫秒,因此只需优化即可快 15 倍它(你可能会发现更好的优化):
%%timeit
res_ADIT = []
for lon1, lat1 in zip (nbi.LONG_017.to_numpy(),
nbi.LAT_016.to_numpy()):
StoredOCC = []
for lon2, lat2, SWw in zip(Final.X.to_numpy(),
Final.Y.to_numpy(),
Final.CUM_OCC.to_numpy()):
r = haversine(lon1, lat1, lon2, lat2)
if (r < 0.03048):
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
res_ADIT.append(np.max(StoredOCC))
else:
res_ADIT.append(np.nan)
nbi['ADIT_v2'] = res_ADIT
# 5.23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
现在,您甚至可以更进一步,使用 numpy
和第二个循环的矢量化,还可以直接传递弧度值,而不必在每个循环中传递 map
它们:
# empty list for result
res_ADIT = []
# work with array in radians
arr_lon2 = np.radians(Final.X.to_numpy())
arr_lat2 = np.radians(Final.Y.to_numpy())
arr_OCC = Final.CUM_OCC.to_numpy()
for lon1, lat1 in zip (np.radians(nbi.LONG_017), #pass directly the radians
np.radians(nbi.LAT_016)):
# do all the substraction in a vectorize way
arr_dlon = arr_lon2 - lon1
arr_dlat = arr_lat2 - lat1
# same here using numpy functions
arr_dist = np.sin(arr_dlat/2)**2 + np.cos(lat1) * np.cos(arr_lat2) * np.sin(arr_dlon/2)**2
arr_dist = 2 * np.arcsin(np.sqrt(arr_dist))
arr_dist *= 6372.8
# extract the values of CUM_OCC that meet the criteria
r = arr_OCC[arr_dist<0.03048]
# check that at least one element
if r.size>0:
res_ADIT.append(max(r))
else :
res_ADIT.append(np.nan)
nbi['AUDIT_np'] = res_ADIT
如果你这样做 timeit
,你会下降到 819 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
所以比你原来的小数据帧解决方案快大约 90 倍并且所有结果都是相同的
print(nbi)
LONG_017 LAT_016 ADTT_02_2019 ADIT_v2 AUDIT_np
0 73.004170 73.008007 NaN NaN NaN
1 73.007203 73.009683 30.0 30.0 30.0
2 73.000001 73.003134 14.0 14.0 14.0
3 73.003023 73.006923 82.0 82.0 82.0
4 73.001468 73.008764 NaN NaN NaN
5 73.000923 73.008946 NaN NaN NaN
6 73.001863 73.000850 NaN NaN NaN
7 73.003456 73.000391 NaN NaN NaN
8 73.003968 73.001698 21.0 21.0 21.0
9 73.005388 73.008781 NaN NaN NaN
10 73.004192 73.000983 93.0 93.0 93.0
11 73.006852 73.004211 NaN NaN NaN
您可以尝试一下代码并增加每个玩具数据帧的大小,您将看到如何通过增加数据的大小,尤其是 Final 数据,增益变得有趣,例如 size_nbi = 20
并且 size_Final = 1000
,您使用矢量化解决方案的速度提高了 400 倍。等等。对于您的完整数据大小 (3K *6M),您仍然需要一些时间,在我的计算机上,我估计大约需要 25 分钟,而您的解决方案需要数百小时。现在,如果这还不够,您可以考虑并行化或也使用 numba
这种使用 Balltree 的方法在我的机器上大约需要一分钟(取决于半径,半径越小越快),您提到的尺寸(7000 和 6M)
import numpy as np
import sklearn
import pandas as pd
我生成数据,谢谢 Ben 我使用了你的代码
#toy sample
np.random.seed(1)
size_nbi = 7000
size_Final = 6000000
nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/10+73,
'LAT_016':np.random.random(size=size_nbi)/10+73,})
Final = pd.DataFrame({'X':np.random.random(size=size_Final)/10+73,
'Y':np.random.random(size=size_Final)/10+73,
'CUM_OCC':np.random.randint(size_Final,size=size_Final)})
nbi_gps = nbi[["LAT_016", "LONG_017"]].values
final_gps = Final[["Y", "X"]].values
创建 Balltree
%%time
from sklearn.neighbors import BallTree
import numpy as np
nbi = np.radians(nbi_gps)
final = np.radians(final_gps)
tree = BallTree(final, leaf_size=12, metric='haversine')
其中 Wall time: 23.8 s
查询
%%time
radius = 0.0000003
StoredOCC_indici = tree.query_radius(nbi, r=radius, return_distance=False, count_only=False)
(不到一秒)
并充分利用您感兴趣的事物
StoredOCC = [ np.max( Final.CUM_OCC[i] ) for i in StoredOCC_indici]
自动为空列表生成 Nan
s,这很好。这花了 Wall time: 3.64 s
为此 radius
在我的机器上对 600 万的整个计算花费了不到一分钟的时间。