如何使用通用键列和位置连接使用 geopandas
How to sjoin using geopandas using a common key column and also location
假设我有一个由两列组成的数据框 A:几何(点)和小时。
数据框 B 也由几何(形状)和小时组成。
我熟悉标准 sjoin 。我想要做的是仅当时间 相同 时才从两个表中创建 sjoin link 行。在传统的连接术语中,键是几何和小时。我怎样才能达到它?
回顾了两个逻辑方法
- 空间连接后跟过滤器
- 首先在小时分片(过滤)数据帧,空间连接分片并连接分片数据帧的结果
- 相等性测试结果
- 运行一些时间
结论
- 此测试数据集的时间差异很小。 简单如果点数少
更快
import pandas as pd
import numpy as np
import geopandas as gpd
import shapely.geometry
import requests
# source some points and polygons
# fmt: off
dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
df_poly = gpd.GeoDataFrame.from_features(res.json())
# fmt: on
# bulk up number of points
dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
HOURS = 24
dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))
def simple():
return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]
def shard():
return pd.concat(
[
gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
for h in range(HOURS)
]
)
print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
dataframe test: {simple().sort_index().equals(shard().sort_index())}
points: {len(dfp)}
polygons: {len(df_poly)}""")
%timeit simple()
%timeit shard()
输出
length test: True 3480
dataframe test: True
points: 84000
polygons: 379
6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
假设我有一个由两列组成的数据框 A:几何(点)和小时。 数据框 B 也由几何(形状)和小时组成。
我熟悉标准 sjoin 。我想要做的是仅当时间 相同 时才从两个表中创建 sjoin link 行。在传统的连接术语中,键是几何和小时。我怎样才能达到它?
回顾了两个逻辑方法
- 空间连接后跟过滤器
- 首先在小时分片(过滤)数据帧,空间连接分片并连接分片数据帧的结果
- 相等性测试结果
- 运行一些时间
结论
- 此测试数据集的时间差异很小。 简单如果点数少 更快
import pandas as pd
import numpy as np
import geopandas as gpd
import shapely.geometry
import requests
# source some points and polygons
# fmt: off
dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
df_poly = gpd.GeoDataFrame.from_features(res.json())
# fmt: on
# bulk up number of points
dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
HOURS = 24
dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))
def simple():
return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]
def shard():
return pd.concat(
[
gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
for h in range(HOURS)
]
)
print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
dataframe test: {simple().sort_index().equals(shard().sort_index())}
points: {len(dfp)}
polygons: {len(df_poly)}""")
%timeit simple()
%timeit shard()
输出
length test: True 3480
dataframe test: True
points: 84000
polygons: 379
6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)