将迭代重写为具有多个输入的 .apply 函数 pandas

rewrite iteration into .apply function with several inputs pandas

我有两个 DataFrame。一个包含几个发电厂及其各自的经度和纬度位置,每个都在一列中。另一个数据框包含几个变电站,也有经度和纬度。我要做的是将发电厂分配到最近的变电站。

df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881,  8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}

我找到了这个解决方案,基本上正是我需要的:

import pandas as pd
import geopy.distance



for i,row in df1.iterrows(): # A
    a = row.x, row.y
    distances = []
    for j,row2 in df2.iterrows(): # B
        b = row2.x, row2.y
        distances.append(geopy.distance.geodesic(a, b).km)

    min_distance = min(distances)
    min_index = distances.index(min_distance)

 
    df1['assigned_to'] =  min_index

它可以工作,但速度非常慢,而且我的数据集非常大。我认为使用 .apply 的方法会更快,但我真的想不出使用它的方法。有人知道如何将上述函数重写为不需要迭代的 .apply 方法吗?

一个解决方案 8 倍快(灵感来自 this topic):

def closest_node(node, nodes):
    nodes = np.asarray(nodes)
    deltas = nodes - node
    dist_2 = np.einsum('ij,ij->i', deltas, deltas)
    return np.argmin(dist_2)

df2_array = df2[["x","y"]].values

df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)

速度比较:

之前:

%%timeit
for i,row in df1.iterrows(): # A
    a = row.x, row.y
    distances = []
    for j,row2 in df2.iterrows(): # B
        b = row2.x, row2.y
        distances.append(geopy.distance.geodesic(a, b).km)

    min_distance = min(distances)
    min_index = distances.index(min_distance)

 
    df1['assigned_to'] =  min_index

给出8.75 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

之后:

%%timeit
df2_array = df2[["x","y"]].values

df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)

给出1.7 ms ± 31.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)