使用 GeoPandas 的两个数据帧中的 K 最近点
K-nearest points from two dataframes with GeoPandas
GeoPandas 在底层使用 shapely。为了获得最近的邻居,我看到了 nearest_points
from shapely 的用法。但是,这种方法不包括 k-最近点。
我需要计算从 GeoDataFrames 到最近点的距离,并将距离插入包含 "from this point" 数据的 GeoDataFrame。
这是我使用 GeoSeries.distance()
而不使用其他包或库的方法。请注意,当 k == 1
时,返回值基本上显示到最近点的距离。 .
这对我的数据很有效,但我想知道是否有更好或更快的方法或使用 shapely 或 sklearn.neighbors 的其他好处?
import pandas as pd
import geopandas as gp
gdf1 > GeoDataFrame with point type geometry column - distance from this point
gdf2 > GeoDataFrame with point type geometry column - distance to this point
def knearest(from_points, to_points, k):
distlist = to_points.distance(from_points)
distlist.sort_values(ascending=True, inplace=True) # To have the closest ones first
return distlist[:k].mean()
# looping through a list of nearest points
for Ks in [1, 2, 3, 4, 5, 10]:
name = 'dist_to_closest_' + str(Ks) # to set column name
gdf1[name] = gdf1.geometry.apply(knearest, args=(gdf2, closest_x))
是的,但首先,我必须感谢 automating GIS process, here's the source code 的赫尔辛基大学。方法如下
首先,读取数据,例如,为每个建筑物找到最近的公共汽车站。
# Filepaths
stops = gpd.read_file('data/pt_stops_helsinki.gpkg')
buildings = read_gdf_from_zip('data/building_points_helsinki.zip')
定义函数,这里可以调整k_neighbors
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=1):
"""Find nearest neighbors for all source points from a set of candidate points"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='haversine')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
# Return indices and distances
return (closest, closest_dist)
def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
"""
For each point in left_gdf, find closest point in right GeoDataFrame and return them.
NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
"""
left_geom_col = left_gdf.geometry.name
right_geom_col = right_gdf.geometry.name
# Ensure that index in right gdf is formed of sequential numbers
right = right_gdf.copy().reset_index(drop=True)
# Parse coordinates from points and insert them into a numpy array as RADIANS
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
# Find the nearest points
# -----------------------
# closest ==> index in right_gdf that corresponds to the closest point
# dist ==> distance between the nearest neighbors (in meters)
closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)
# Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
closest_points = right.loc[closest]
# Ensure that the index corresponds the one in left_gdf
closest_points = closest_points.reset_index(drop=True)
# Add distance if requested
if return_dist:
# Convert to meters from radians
earth_radius = 6371000 # meters
closest_points['distance'] = dist * earth_radius
return closest_points
做最近邻分析
# Find closest public transport stop for each building and get also the distance based on haversine distance
# Note: haversine distance which is implemented here is a bit slower than using e.g. 'euclidean' metric
# but useful as we get the distance between points in meters
closest_stops = nearest_neighbor(buildings, stops, return_dist=True)
现在加入起始和终止数据框
# Rename the geometry of closest stops gdf so that we can easily identify it
closest_stops = closest_stops.rename(columns={'geometry': 'closest_stop_geom'})
# Merge the datasets by index (for this, it is good to use '.join()' -function)
buildings = buildings.join(closest_stops)
上面使用 Automating GIS-processes 的答案非常好,但是在将点作为 numpy 数组转换为 RADIANS 时出现错误。纬度和经度颠倒了。
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())
确实点是用 (lat, lon) 给出的,但经度对应于平面或球体的 x 轴,纬度对应于 y 轴。
如果您的数据在网格坐标中,那么该方法会更简洁一些,但有一个关键问题。
以 sutan's answer 为基础并精简 Uni Helsinki 的模块...
要获得多个邻居,您可以编辑 k_neighbors 参数....并且还必须在函数的 body 中对变量进行硬编码(请参阅我在下面添加的内容 'closest' 和'closest_dist') 并将它们添加到 return 语句中。
因此,如果您想要 2 个最近的点,它看起来像:
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=2):
"""
Find nearest neighbors for all source points from a set of candidate points
modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html
"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='euclidean')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
closest_second = indices[1] # *manually add per comment above*
closest_second_dist = distances[1] # *manually add per comment above*
# Return indices and distances
return (closest, closest_dist, closest_sec, closest_sec_dist)
输入是 (x,y) 元组的列表。因此,由于(按问题标题)您的数据位于 GeoDataframe 中:
# easier to read
in_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf1.iterrows()]
qry_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf2.iterrows()]
# faster (by about 7X)
in_pts = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
qry_pts = [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]
我对距离不感兴趣,所以我没有注释掉函数,而是 运行:
idx_nearest, _, idx_2ndnearest, _ = get_nearest(in_pts, qry_pts)
并得到两个长度相同的 in_pts 数组,分别包含 qry_pts.
的原始地理数据帧中最近点和次近点的索引值
GeoPandas 在底层使用 shapely。为了获得最近的邻居,我看到了 nearest_points
from shapely 的用法。但是,这种方法不包括 k-最近点。
我需要计算从 GeoDataFrames 到最近点的距离,并将距离插入包含 "from this point" 数据的 GeoDataFrame。
这是我使用 GeoSeries.distance()
而不使用其他包或库的方法。请注意,当 k == 1
时,返回值基本上显示到最近点的距离。
这对我的数据很有效,但我想知道是否有更好或更快的方法或使用 shapely 或 sklearn.neighbors 的其他好处?
import pandas as pd
import geopandas as gp
gdf1 > GeoDataFrame with point type geometry column - distance from this point
gdf2 > GeoDataFrame with point type geometry column - distance to this point
def knearest(from_points, to_points, k):
distlist = to_points.distance(from_points)
distlist.sort_values(ascending=True, inplace=True) # To have the closest ones first
return distlist[:k].mean()
# looping through a list of nearest points
for Ks in [1, 2, 3, 4, 5, 10]:
name = 'dist_to_closest_' + str(Ks) # to set column name
gdf1[name] = gdf1.geometry.apply(knearest, args=(gdf2, closest_x))
是的,但首先,我必须感谢 automating GIS process, here's the source code 的赫尔辛基大学。方法如下
首先,读取数据,例如,为每个建筑物找到最近的公共汽车站。
# Filepaths
stops = gpd.read_file('data/pt_stops_helsinki.gpkg')
buildings = read_gdf_from_zip('data/building_points_helsinki.zip')
定义函数,这里可以调整k_neighbors
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=1):
"""Find nearest neighbors for all source points from a set of candidate points"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='haversine')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
# Return indices and distances
return (closest, closest_dist)
def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
"""
For each point in left_gdf, find closest point in right GeoDataFrame and return them.
NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
"""
left_geom_col = left_gdf.geometry.name
right_geom_col = right_gdf.geometry.name
# Ensure that index in right gdf is formed of sequential numbers
right = right_gdf.copy().reset_index(drop=True)
# Parse coordinates from points and insert them into a numpy array as RADIANS
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
# Find the nearest points
# -----------------------
# closest ==> index in right_gdf that corresponds to the closest point
# dist ==> distance between the nearest neighbors (in meters)
closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)
# Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
closest_points = right.loc[closest]
# Ensure that the index corresponds the one in left_gdf
closest_points = closest_points.reset_index(drop=True)
# Add distance if requested
if return_dist:
# Convert to meters from radians
earth_radius = 6371000 # meters
closest_points['distance'] = dist * earth_radius
return closest_points
做最近邻分析
# Find closest public transport stop for each building and get also the distance based on haversine distance
# Note: haversine distance which is implemented here is a bit slower than using e.g. 'euclidean' metric
# but useful as we get the distance between points in meters
closest_stops = nearest_neighbor(buildings, stops, return_dist=True)
现在加入起始和终止数据框
# Rename the geometry of closest stops gdf so that we can easily identify it
closest_stops = closest_stops.rename(columns={'geometry': 'closest_stop_geom'})
# Merge the datasets by index (for this, it is good to use '.join()' -function)
buildings = buildings.join(closest_stops)
上面使用 Automating GIS-processes 的答案非常好,但是在将点作为 numpy 数组转换为 RADIANS 时出现错误。纬度和经度颠倒了。
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())
确实点是用 (lat, lon) 给出的,但经度对应于平面或球体的 x 轴,纬度对应于 y 轴。
如果您的数据在网格坐标中,那么该方法会更简洁一些,但有一个关键问题。
以 sutan's answer 为基础并精简 Uni Helsinki 的模块...
要获得多个邻居,您可以编辑 k_neighbors 参数....并且还必须在函数的 body 中对变量进行硬编码(请参阅我在下面添加的内容 'closest' 和'closest_dist') 并将它们添加到 return 语句中。
因此,如果您想要 2 个最近的点,它看起来像:
from sklearn.neighbors import BallTree
import numpy as np
def get_nearest(src_points, candidates, k_neighbors=2):
"""
Find nearest neighbors for all source points from a set of candidate points
modified from: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html
"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='euclidean')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
closest_second = indices[1] # *manually add per comment above*
closest_second_dist = distances[1] # *manually add per comment above*
# Return indices and distances
return (closest, closest_dist, closest_sec, closest_sec_dist)
输入是 (x,y) 元组的列表。因此,由于(按问题标题)您的数据位于 GeoDataframe 中:
# easier to read
in_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf1.iterrows()]
qry_pts = [(row.geometry.x, row.geometry.y) for idx, row in gdf2.iterrows()]
# faster (by about 7X)
in_pts = [(x,y) for x,y in zip(gdf1.geometry.x , gdf1.geometry.y)]
qry_pts = [(x,y) for x,y in zip(gdf2.geometry.x , gdf2.geometry.y)]
我对距离不感兴趣,所以我没有注释掉函数,而是 运行:
idx_nearest, _, idx_2ndnearest, _ = get_nearest(in_pts, qry_pts)
并得到两个长度相同的 in_pts 数组,分别包含 qry_pts.
的原始地理数据帧中最近点和次近点的索引值