通过两个 Pandas DataFrame 加速嵌套 for 循环

Question

我将纬度和经度存储在 pandas 数据帧 (df) 中，stop_id, stoplat, stoplon 的填充点为 NaN，在另一个数据帧 [=17] 中=]，其中包含更多lats/lons和任意id；这是要填充到 df.

中的信息

我正在尝试将两者联系起来，以便 df 中的停靠栏包含有关最接近 lat/lon 点的停靠站的信息，如果有，则将其保留为 NaN在该点的半径 R 内没有停靠点。

现在我的代码如下，但是它需要很长时间（我现在运行ning 需要超过 40 分钟，然后才将区域更改为 df 并使用 itertuples；不是确定这会产生多大的差异？）因为每组数据有数千个 lat/lon 点和停止点，这是一个问题，因为我需要在多个文件上运行这个。我正在寻找使它运行更快的建议。我已经做了一些非常小的改进（例如，移动到数据框，使用 itertuples 而不是 iterrows，在循环之外定义 lats 和 lons 以避免在每个循环中都从 df 检索它）但我没有想法加快速度。 getDistance 使用定义的 Haversine 公式来获取停车标志与给定经纬度点之间的距离。

import pandas as pd
from math import cos, asin, sqrt

R=5
lats = df['lat']
lons = df['lon']
for stop in areadf.itertuples():
    for index in df.index:
        if getDistance(lats[index],lons[index],
                       stop[1],stop[2]) < R:
            df.at[index,'stop_id'] = stop[0] # id
            df.at[index,'stoplat'] = stop[1] # lat
            df.at[index,'stoplon'] = stop[2] # lon

def getDistance(lat1,lon1,lat2,lon2):
    p = 0.017453292519943295     #Pi/180
    a = (0.5 - cos((lat2 - lat1) * p)/2 + cos(lat1 * p) * 
         cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
    return 12742 * asin(sqrt(a)) * 100

示例数据：

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  NaN        NaN        NaN
43.694324  -79.334555  NaN        NaN        NaN

areadf
stop_id    stoplat    stoplon
0          43.657675  -79.380145
1          45.435143  -90.543253

期望：

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  0          43.657675  -79.380145
43.694324  -79.334555  NaN        NaN        NaN

Answer 1

一种方法是使用 here 中的 numpy haversine 函数，只需稍微修改一下，以便您可以考虑所需的半径。

只需使用 apply 遍历 df 并在给定半径

内找到最接近的值

def haversine_np(lon1, lat1, lon2, lat2,R):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    All args must be of equal length.    
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    if km.min() <= R:
        return km.argmin()
    else:
        return -1

df['dex'] = df[['lat','lon']].apply(lambda row: haversine_np(row[1],row[0],areadf.stoplon.values,areadf.stoplat.values,1),axis=1)

然后合并两个数据帧。

df.merge(areadf,how='left',left_on='dex',right_index=True).drop('dex',axis=1)

         lat        lon  stop_id    stoplat    stoplon
0  43.657676 -79.380146      0.0  43.657675 -79.380145
1  43.694324 -79.334555      NaN        NaN        NaN

注意：如果您选择遵循此方法，则必须确保两个数据帧索引都已重置或它们按顺序从 0 到 df 的总 len 排序。因此，请务必在运行之前重置索引。

df.reset_index(drop=True,inplace=True)
areadf.reset_index(drop=True,inplace=True)

通过两个 Pandas DataFrame 加速嵌套 for 循环

Speeding up a nested for loop through two Pandas DataFrames

python

performance

nested

nested-loops

pandas