将迭代重写为具有多个输入的 .apply 函数 pandas
rewrite iteration into .apply function with several inputs pandas
我有两个 DataFrame。一个包含几个发电厂及其各自的经度和纬度位置,每个都在一列中。另一个数据框包含几个变电站,也有经度和纬度。我要做的是将发电厂分配到最近的变电站。
df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}
我找到了这个解决方案,基本上正是我需要的:
import pandas as pd
import geopy.distance
for i,row in df1.iterrows(): # A
a = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
df1['assigned_to'] = min_index
它可以工作,但速度非常慢,而且我的数据集非常大。我认为使用 .apply 的方法会更快,但我真的想不出使用它的方法。有人知道如何将上述函数重写为不需要迭代的 .apply 方法吗?
一个解决方案 8 倍快(灵感来自 this topic):
def closest_node(node, nodes):
nodes = np.asarray(nodes)
deltas = nodes - node
dist_2 = np.einsum('ij,ij->i', deltas, deltas)
return np.argmin(dist_2)
df2_array = df2[["x","y"]].values
df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)
速度比较:
之前:
%%timeit
for i,row in df1.iterrows(): # A
a = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
df1['assigned_to'] = min_index
给出8.75 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
之后:
%%timeit
df2_array = df2[["x","y"]].values
df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)
给出1.7 ms ± 31.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我有两个 DataFrame。一个包含几个发电厂及其各自的经度和纬度位置,每个都在一列中。另一个数据框包含几个变电站,也有经度和纬度。我要做的是将发电厂分配到最近的变电站。
df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}
我找到了这个解决方案,基本上正是我需要的:
import pandas as pd
import geopy.distance
for i,row in df1.iterrows(): # A
a = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
df1['assigned_to'] = min_index
它可以工作,但速度非常慢,而且我的数据集非常大。我认为使用 .apply 的方法会更快,但我真的想不出使用它的方法。有人知道如何将上述函数重写为不需要迭代的 .apply 方法吗?
一个解决方案 8 倍快(灵感来自 this topic):
def closest_node(node, nodes):
nodes = np.asarray(nodes)
deltas = nodes - node
dist_2 = np.einsum('ij,ij->i', deltas, deltas)
return np.argmin(dist_2)
df2_array = df2[["x","y"]].values
df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)
速度比较:
之前:
%%timeit
for i,row in df1.iterrows(): # A
a = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
df1['assigned_to'] = min_index
给出8.75 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
之后:
%%timeit
df2_array = df2[["x","y"]].values
df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)
给出1.7 ms ± 31.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)