如何加速 geopandas 空间连接?
How can I accelerate a geopandas spatial join?
我有两个 geopandas 数据框。对于左框架中的每一行,我想找出右框架中的哪些行在空间上与该行重叠,以及重叠多少。获得此信息后,我将能够根据重叠程度进行空间连接。
不幸的是,我在大量多边形上执行此操作:一个州的所有美国人口普查区(德克萨斯州有 5,265 个)和大量大小相似(但不重合)的多边形美国人口普查区块(德克萨斯州有 ~914,231 个这样的区块)。
我正在寻找一种更快地完成此操作的方法。到目前为止,我的代码如下。
所使用的数据集可以从美国人口普查获得:blocks data, tracts data。
#!/usr/bin/env python3
import geopandas as gpd
import geopandas_fast_sjoin as gpfsj
import time
import os
import pickle
import sys
os.environ["GDAL_DATA"] = "/usr/share/gdal"
TRACT_FILE = "./data/tracts/tl_2010_{fips}_tract10.shp"
BLOCK_FILE = "./data/blocks/tabblock2010_{fips}_pophu.shp"
PROJECTION = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'
print("Reading data...")
start_time = time.time()
tracts = gpd.read_file(TRACT_FILE.format(fips=48))
blocks = gpd.read_file(BLOCK_FILE.format(fips=48))
print('Time: ', time.time()-start_time )
print("Converting coordinate reference systems...")
start_time = time.time()
tracts = tracts.to_crs(PROJECTION)
blocks = blocks.to_crs(PROJECTION)
print('Time: ', time.time()-start_time )
print("Performing spatial join...")
start_time = time.time()
joined = gpd.sjoin(tracts, blocks, how='left')
print('Time: ', time.time()-start_time )
print("Calculating areas of intersection...")
start_time = time.time()
joined['area_of_intersect'] = [row['geometry'].intersection(blocks.loc[row['index_right']]['geometry']).area for i,row in joined.iterrows()]
print('Time: ', time.time()-start_time )
有几个优化可以使这个操作更快:在不涉及 Python 的情况下用 C++ 完成所有工作,使用空间索引快速识别交叉路口的候选,使用 Prepared Geometries 快速检查候选,并在可用内核上并行化整个操作。
所有这些都可以在 Python 中完成,但有一些开销的缺点,除了在我的测试中,在您的多千兆字节数据集上使用多处理模块导致 Python 耗尽可用内存。在 Windows 上,这可能是不可避免的,在 Linux 上,写时复制应该可以阻止它。也许可以通过仔细的编程来完成,但是使用 Python 的全部意义在于不必担心这些细节。因此,我选择将计算转移到C++。
为此,我使用 pybind11 构建了一个新的 Python 模块,该模块接受来自 geopandas 的几何列表并生成三个列表的输出:(1) 左手的行索引几何形状; (2) 右手几何的行索引; (3) 两者重叠的面积(仅当>0).
例如,对于几何图形左=[A,B,C,D] 和右=[E,F,G,H] 的输入,
让:
- E完全在A
- F 与 A 和 B 重叠
- G 和 H 不重叠
然后 return 看起来像:
List1 List2 List3
A E Area(E)
A F AreaIntersection(A,F)
B F AreaIntersection(B,F)
在我的机器上,sjoin
运算用了 73 秒,计算交点用了 1,066 秒,总共用了 1139 秒(19 分钟)
在我的 12 核机器上,下面的代码需要 50 秒才能完成所有这些。
因此,对于需要交集区域的空间连接,这只节省了一点时间。但是,对于需要 相交区域的空间连接,这可以节省大量时间。换句话说,计算所有这些交叉点需要大量工作!
在进一步测试中,在不使用准备好的加速几何结构的情况下计算交叉区域在 12 个核心上花费了 287 秒。因此,并行化交叉点会导致 4 倍的加速,而与准备好的几何图形并行化会导致 23 倍的加速。
生成文件
all:
$(CXX) -O3 -g -shared -std=c++11 -I include `python3-config --cflags --ldflags --libs` quick_join.cpp -o geopandas_fast_sjoin.so -fPIC -Wall -lgeos -fopenmp
quick_join.cpp
#define GEOS_USE_ONLY_R_API 1
#include <geos/geom/Geometry.h>
#include <geos/geom/prep/PreparedGeometry.h>
#include <geos/geom/prep/PreparedGeometryFactory.h>
#include <geos/index/strtree/STRtree.h>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <memory>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_max_threads() 1
#define omp_get_thread_num() 0
#endif
///Fast Spatial Joins
///
///@params gp_left List of GEOS geometry pointers from the left data frame.
/// The code works best if gp_left is comprised of relatively
/// fewer and relatively larger geometries.
///@params gp_right List of GEOS geometry pointers from the right data frame
/// The code works best if gp_right is comprised of relatively
/// more numerous and relatively smaller geometries.
///
///The list of GEOS geometry pointers can be acquired with
/// geos_pointers = [x._geom for x in df['geometry']]
///
///A common task in GeoPandas is taking two dataframes and combining their
///contents based on how much the contents' geometries overlap. However, this
///operation is slow in GeoPandas because most of it is performed in Python.
///Here, we offload the entire computation to C++ and use a number of techniques
///to achieve good performance.
///
///Namely, we create a spatial index from the left-hand geometries. For each
///geometry from the right-hand side, this allows us to very quickly find which
///geometries on the left-hand side it might overlap with. For each geometry on
///the left-hand side, we create a "prepared geometry", this accelerates simple
///spatial queries, such as checking for containment or disjointedness, by an
///order of magnitude. Finally, we parallelize the entire computation across all
///of the computer's threads.
///
///@return Three lists: (1) Row indices of left-hand geometries; (2) Row indices
/// of right-hand geometries; (3) Area of overlap between the two (only
/// if >0).
///
///For example, for an input with geometries left=[A,B,C,D] and right=[E,F,G,H],
///let:
/// * E be entirely in A
/// * F overlap with A and B
/// * G and H overlap nothing
///
///Then the return looks like:
///List1 List2 List3
///A E Area(E)
///A F AreaIntersection(A,F)
///B F AreaIntersection(B,F)
pybind11::tuple fast_sjoin(pybind11::list gp_left, pybind11::list gp_right) {
typedef geos::geom::Geometry Geometry;
typedef geos::geom::prep::PreparedGeometry PGeometry;
//These are our return values
std::vector<size_t> lefts; //Indices of geometries from the left
std::vector<size_t> rights; //Geometries of the overlapping geometries from the right
std::vector<double> areas; //Area of the overlap (if it is >0)
//If either list is empty, there is nothing to do
if(gp_left.size()==0 || gp_right.size()==0)
return pybind11::make_tuple(lefts, rights, areas);
//Used to cache pointers to geometries so we don't have to constantly be doing
//conversions
std::vector<const Geometry* > lgeoms;
std::vector<const Geometry* > rgeoms;
//Prepared geometries can massively accelerate computation by provide quick
//predicate checks on whether one geometry contains another. Here, we stash
//the prepared geometries. Unfortunately, GEOS 3.6.2's prepared geometries are
//not reentrant. So we make a private set of prepared geometries for each
//thread
std::vector<std::vector<const PGeometry*>> lpgeoms(omp_get_max_threads());
//Used for creating prepared geometries
geos::geom::prep::PreparedGeometryFactory preparer;
//For each geometry on the left, convert the input (a Python integer) into a
//geometry pointer, then create a prepared geometry
for(size_t i=0;i<gp_left.size();i++){
const size_t ptr_add = gp_left[i].cast<size_t>(); //Item is an integer
const Geometry* geom = reinterpret_cast<Geometry*>(ptr_add);
for(int i=0;i<omp_get_max_threads();i++)
lpgeoms.at(i).push_back(preparer.prepare(geom));
lgeoms.push_back(geom);
}
//For each geometry on the right, convert the input (a Python integer) into a
//geometry pointer
for(size_t i=0;i<gp_right.size();i++){
const size_t ptr_add = gp_right[i].cast<size_t>(); //Item is an integer
const Geometry* geom = reinterpret_cast<Geometry*>(ptr_add);
rgeoms.push_back(geom);
}
//The STRtree spatial index stores rectangles and allows us to quickly find
//all the rectangles that overlap with a query rectangle. We leverage this
//here by inserting the bounding boxes of all of the left geometries into a
//spatial index. The spatial index also allows us to store a pointer to a data
//structure; this pointer is returned if a query finds a hit or hits. We abuse
//this capability by using the pointer to store the array index containing the
//left geometry. This allows us to quickly find both the left geometry and its
//associated prepared geometry.
geos::index::strtree::STRtree index;
for(size_t i=0;i<lgeoms.size();i++)
index.insert(lgeoms[i]->getEnvelopeInternal(), reinterpret_cast<void*>(i));
//Once all of the geometries are inserted into the spatial index, the index
//must be built. This must be done in serial since GEOS does not have
//protection against multiple threads trying it. (This is logical since it
//eliminates a lock that would otherwise slow queries, but a better design
//would probably have been to throw an exception.) The GEOS spatial tree also
//lacks an explicit command for building the tree (wtf), so here we perform a
//meaningless, single-threaded query to ensure the tree gets built.
{
std::vector<void *> results;
index.query( lgeoms[0]->getEnvelopeInternal(), results );
}
//Each query to the spatial index populates a predefined vector with the
//results of the query. We define this vector here, outside of the loop, to
//avoid the memory cost of reallocating on each iteration of the loop.
std::vector<void *> results;
//These are custom OpenMP reduction operators for combining vectors together
//following the parallel section. We leverage them to have each thread build
//its own private result vectors which are afterwards combined into a single
//result.
#pragma omp declare reduction(merge : std::vector<uint64_t> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp declare reduction(merge : std::vector<double> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
//Now we loop through all of the geometries on the right-hand side. We do this
//in parallel because we assume there are many of them.
#pragma omp parallel for default(none) shared(rgeoms,index,lgeoms,lpgeoms) private(results) reduction(merge:lefts) reduction(merge:rights) reduction(merge:areas)
for(unsigned int r=0;r<rgeoms.size();r++){
const Geometry *const rgeom = rgeoms.at(r);
index.query( rgeom->getEnvelopeInternal(), results );
for(const auto q: results){
//results is a list of "pointers". But we abused the pointers by using
//them to stash array indices. Let's unpack the "pointers" into indices
//now.
const size_t lnum = reinterpret_cast<size_t>(q);
const Geometry *const lgeom = lgeoms.at(lnum);
const PGeometry *const lpgeom = lpgeoms.at(omp_get_thread_num()).at(lnum);
if(lpgeom->contains(rgeom)){
//The right-hand geometry is entirely inside the left-hand geometry
lefts.push_back(lnum);
rights.push_back(r);
areas.push_back(rgeom->getArea());
} else if(lpgeom->disjoint(rgeom)){
//The right-hand geometry is entirely outside the left-hand geometry
} else {
//The right-hand geometry is partially inside and partially outside the
//left-hand geometry, so we get the area of intersection of the two.
std::unique_ptr<Geometry> igeom(rgeom->intersection(lgeom));
const auto iarea = igeom->getArea();
lefts.push_back(lnum);
rights.push_back(r);
areas.push_back(iarea);
}
}
//Clear the results vector so we're ready for the next iteration. Note that
//clearing it does not release its memory, so after the first few iterations
//we should no longer be performing allocations.
results.clear();
}
return pybind11::make_tuple(lefts, rights, areas);
}
PYBIND11_MODULE(geopandas_fast_sjoin,m){
m.doc() = "Fast spatial joins";
m.def("fast_sjoin", &fast_sjoin, "Performs a fast spatial join");
}
test.py
#!/usr/bin/env python3
import geopandas as gpd
import geopandas_fast_sjoin as gpfsj
import time
import os
import pickle
import sys
os.environ["GDAL_DATA"] = "/usr/share/gdal"
DATA_DIR = "./data/"
TRACT_FILE = "./data/tracts/tl_2010_{fips}_tract10.shp"
BLOCK_FILE = "./data/blocks/tabblock2010_{fips}_pophu.shp"
PRECINCT_FILE = "./data/precincts/precincts2008/USA_precincts.shp"
STATES_FILE = "./data/states/tl_2010_us_state10.shp"
PROJECTION = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'
print("Reading data...")
start_time = time.time()
tracts = gpd.read_file(TRACT_FILE.format(fips=48))
blocks = gpd.read_file(BLOCK_FILE.format(fips=48))
print('Time: ', time.time()-start_time )
print("Converting coordinate reference systems...")
start_time = time.time()
tracts = tracts.to_crs(PROJECTION)
blocks = blocks.to_crs(PROJECTION)
print('Time: ', time.time()-start_time )
print("Performing spatial join...")
start_time = time.time()
joined = gpd.sjoin(tracts, blocks, how='left')
joined['area_of_intersect'] = [row['geometry'].intersection(blocks.loc[row['index_right']]['geometry']).area for i,row in joined.iterrows()]
print('Time: ', time.time()-start_time )
# pickle.dump( (blocks,tracts), open( "save.p", "wb" ) )
# sys.exit(-1)
# blocks, tracts = pickle.load( open( "save.p", "rb" ) )
print("Getting geometries...")
start_time = time.time()
tgeoms = [x._geom for x in tracts['geometry']]
bgeoms = [x._geom for x in blocks['geometry']]
print('Time: ', time.time()-start_time )
print("Running example")
start_time = time.time()
lefts, rights, areas = gpfsj.fast_sjoin(tgeoms,bgeoms)
print('Time: ', time.time()-start_time )
我有两个 geopandas 数据框。对于左框架中的每一行,我想找出右框架中的哪些行在空间上与该行重叠,以及重叠多少。获得此信息后,我将能够根据重叠程度进行空间连接。
不幸的是,我在大量多边形上执行此操作:一个州的所有美国人口普查区(德克萨斯州有 5,265 个)和大量大小相似(但不重合)的多边形美国人口普查区块(德克萨斯州有 ~914,231 个这样的区块)。
我正在寻找一种更快地完成此操作的方法。到目前为止,我的代码如下。
所使用的数据集可以从美国人口普查获得:blocks data, tracts data。
#!/usr/bin/env python3
import geopandas as gpd
import geopandas_fast_sjoin as gpfsj
import time
import os
import pickle
import sys
os.environ["GDAL_DATA"] = "/usr/share/gdal"
TRACT_FILE = "./data/tracts/tl_2010_{fips}_tract10.shp"
BLOCK_FILE = "./data/blocks/tabblock2010_{fips}_pophu.shp"
PROJECTION = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'
print("Reading data...")
start_time = time.time()
tracts = gpd.read_file(TRACT_FILE.format(fips=48))
blocks = gpd.read_file(BLOCK_FILE.format(fips=48))
print('Time: ', time.time()-start_time )
print("Converting coordinate reference systems...")
start_time = time.time()
tracts = tracts.to_crs(PROJECTION)
blocks = blocks.to_crs(PROJECTION)
print('Time: ', time.time()-start_time )
print("Performing spatial join...")
start_time = time.time()
joined = gpd.sjoin(tracts, blocks, how='left')
print('Time: ', time.time()-start_time )
print("Calculating areas of intersection...")
start_time = time.time()
joined['area_of_intersect'] = [row['geometry'].intersection(blocks.loc[row['index_right']]['geometry']).area for i,row in joined.iterrows()]
print('Time: ', time.time()-start_time )
有几个优化可以使这个操作更快:在不涉及 Python 的情况下用 C++ 完成所有工作,使用空间索引快速识别交叉路口的候选,使用 Prepared Geometries 快速检查候选,并在可用内核上并行化整个操作。
所有这些都可以在 Python 中完成,但有一些开销的缺点,除了在我的测试中,在您的多千兆字节数据集上使用多处理模块导致 Python 耗尽可用内存。在 Windows 上,这可能是不可避免的,在 Linux 上,写时复制应该可以阻止它。也许可以通过仔细的编程来完成,但是使用 Python 的全部意义在于不必担心这些细节。因此,我选择将计算转移到C++。
为此,我使用 pybind11 构建了一个新的 Python 模块,该模块接受来自 geopandas 的几何列表并生成三个列表的输出:(1) 左手的行索引几何形状; (2) 右手几何的行索引; (3) 两者重叠的面积(仅当>0).
例如,对于几何图形左=[A,B,C,D] 和右=[E,F,G,H] 的输入, 让:
- E完全在A
- F 与 A 和 B 重叠
- G 和 H 不重叠
然后 return 看起来像:
List1 List2 List3
A E Area(E)
A F AreaIntersection(A,F)
B F AreaIntersection(B,F)
在我的机器上,sjoin
运算用了 73 秒,计算交点用了 1,066 秒,总共用了 1139 秒(19 分钟)
在我的 12 核机器上,下面的代码需要 50 秒才能完成所有这些。
因此,对于需要交集区域的空间连接,这只节省了一点时间。但是,对于需要 相交区域的空间连接,这可以节省大量时间。换句话说,计算所有这些交叉点需要大量工作!
在进一步测试中,在不使用准备好的加速几何结构的情况下计算交叉区域在 12 个核心上花费了 287 秒。因此,并行化交叉点会导致 4 倍的加速,而与准备好的几何图形并行化会导致 23 倍的加速。
生成文件
all:
$(CXX) -O3 -g -shared -std=c++11 -I include `python3-config --cflags --ldflags --libs` quick_join.cpp -o geopandas_fast_sjoin.so -fPIC -Wall -lgeos -fopenmp
quick_join.cpp
#define GEOS_USE_ONLY_R_API 1
#include <geos/geom/Geometry.h>
#include <geos/geom/prep/PreparedGeometry.h>
#include <geos/geom/prep/PreparedGeometryFactory.h>
#include <geos/index/strtree/STRtree.h>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <memory>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_max_threads() 1
#define omp_get_thread_num() 0
#endif
///Fast Spatial Joins
///
///@params gp_left List of GEOS geometry pointers from the left data frame.
/// The code works best if gp_left is comprised of relatively
/// fewer and relatively larger geometries.
///@params gp_right List of GEOS geometry pointers from the right data frame
/// The code works best if gp_right is comprised of relatively
/// more numerous and relatively smaller geometries.
///
///The list of GEOS geometry pointers can be acquired with
/// geos_pointers = [x._geom for x in df['geometry']]
///
///A common task in GeoPandas is taking two dataframes and combining their
///contents based on how much the contents' geometries overlap. However, this
///operation is slow in GeoPandas because most of it is performed in Python.
///Here, we offload the entire computation to C++ and use a number of techniques
///to achieve good performance.
///
///Namely, we create a spatial index from the left-hand geometries. For each
///geometry from the right-hand side, this allows us to very quickly find which
///geometries on the left-hand side it might overlap with. For each geometry on
///the left-hand side, we create a "prepared geometry", this accelerates simple
///spatial queries, such as checking for containment or disjointedness, by an
///order of magnitude. Finally, we parallelize the entire computation across all
///of the computer's threads.
///
///@return Three lists: (1) Row indices of left-hand geometries; (2) Row indices
/// of right-hand geometries; (3) Area of overlap between the two (only
/// if >0).
///
///For example, for an input with geometries left=[A,B,C,D] and right=[E,F,G,H],
///let:
/// * E be entirely in A
/// * F overlap with A and B
/// * G and H overlap nothing
///
///Then the return looks like:
///List1 List2 List3
///A E Area(E)
///A F AreaIntersection(A,F)
///B F AreaIntersection(B,F)
pybind11::tuple fast_sjoin(pybind11::list gp_left, pybind11::list gp_right) {
typedef geos::geom::Geometry Geometry;
typedef geos::geom::prep::PreparedGeometry PGeometry;
//These are our return values
std::vector<size_t> lefts; //Indices of geometries from the left
std::vector<size_t> rights; //Geometries of the overlapping geometries from the right
std::vector<double> areas; //Area of the overlap (if it is >0)
//If either list is empty, there is nothing to do
if(gp_left.size()==0 || gp_right.size()==0)
return pybind11::make_tuple(lefts, rights, areas);
//Used to cache pointers to geometries so we don't have to constantly be doing
//conversions
std::vector<const Geometry* > lgeoms;
std::vector<const Geometry* > rgeoms;
//Prepared geometries can massively accelerate computation by provide quick
//predicate checks on whether one geometry contains another. Here, we stash
//the prepared geometries. Unfortunately, GEOS 3.6.2's prepared geometries are
//not reentrant. So we make a private set of prepared geometries for each
//thread
std::vector<std::vector<const PGeometry*>> lpgeoms(omp_get_max_threads());
//Used for creating prepared geometries
geos::geom::prep::PreparedGeometryFactory preparer;
//For each geometry on the left, convert the input (a Python integer) into a
//geometry pointer, then create a prepared geometry
for(size_t i=0;i<gp_left.size();i++){
const size_t ptr_add = gp_left[i].cast<size_t>(); //Item is an integer
const Geometry* geom = reinterpret_cast<Geometry*>(ptr_add);
for(int i=0;i<omp_get_max_threads();i++)
lpgeoms.at(i).push_back(preparer.prepare(geom));
lgeoms.push_back(geom);
}
//For each geometry on the right, convert the input (a Python integer) into a
//geometry pointer
for(size_t i=0;i<gp_right.size();i++){
const size_t ptr_add = gp_right[i].cast<size_t>(); //Item is an integer
const Geometry* geom = reinterpret_cast<Geometry*>(ptr_add);
rgeoms.push_back(geom);
}
//The STRtree spatial index stores rectangles and allows us to quickly find
//all the rectangles that overlap with a query rectangle. We leverage this
//here by inserting the bounding boxes of all of the left geometries into a
//spatial index. The spatial index also allows us to store a pointer to a data
//structure; this pointer is returned if a query finds a hit or hits. We abuse
//this capability by using the pointer to store the array index containing the
//left geometry. This allows us to quickly find both the left geometry and its
//associated prepared geometry.
geos::index::strtree::STRtree index;
for(size_t i=0;i<lgeoms.size();i++)
index.insert(lgeoms[i]->getEnvelopeInternal(), reinterpret_cast<void*>(i));
//Once all of the geometries are inserted into the spatial index, the index
//must be built. This must be done in serial since GEOS does not have
//protection against multiple threads trying it. (This is logical since it
//eliminates a lock that would otherwise slow queries, but a better design
//would probably have been to throw an exception.) The GEOS spatial tree also
//lacks an explicit command for building the tree (wtf), so here we perform a
//meaningless, single-threaded query to ensure the tree gets built.
{
std::vector<void *> results;
index.query( lgeoms[0]->getEnvelopeInternal(), results );
}
//Each query to the spatial index populates a predefined vector with the
//results of the query. We define this vector here, outside of the loop, to
//avoid the memory cost of reallocating on each iteration of the loop.
std::vector<void *> results;
//These are custom OpenMP reduction operators for combining vectors together
//following the parallel section. We leverage them to have each thread build
//its own private result vectors which are afterwards combined into a single
//result.
#pragma omp declare reduction(merge : std::vector<uint64_t> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp declare reduction(merge : std::vector<double> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
//Now we loop through all of the geometries on the right-hand side. We do this
//in parallel because we assume there are many of them.
#pragma omp parallel for default(none) shared(rgeoms,index,lgeoms,lpgeoms) private(results) reduction(merge:lefts) reduction(merge:rights) reduction(merge:areas)
for(unsigned int r=0;r<rgeoms.size();r++){
const Geometry *const rgeom = rgeoms.at(r);
index.query( rgeom->getEnvelopeInternal(), results );
for(const auto q: results){
//results is a list of "pointers". But we abused the pointers by using
//them to stash array indices. Let's unpack the "pointers" into indices
//now.
const size_t lnum = reinterpret_cast<size_t>(q);
const Geometry *const lgeom = lgeoms.at(lnum);
const PGeometry *const lpgeom = lpgeoms.at(omp_get_thread_num()).at(lnum);
if(lpgeom->contains(rgeom)){
//The right-hand geometry is entirely inside the left-hand geometry
lefts.push_back(lnum);
rights.push_back(r);
areas.push_back(rgeom->getArea());
} else if(lpgeom->disjoint(rgeom)){
//The right-hand geometry is entirely outside the left-hand geometry
} else {
//The right-hand geometry is partially inside and partially outside the
//left-hand geometry, so we get the area of intersection of the two.
std::unique_ptr<Geometry> igeom(rgeom->intersection(lgeom));
const auto iarea = igeom->getArea();
lefts.push_back(lnum);
rights.push_back(r);
areas.push_back(iarea);
}
}
//Clear the results vector so we're ready for the next iteration. Note that
//clearing it does not release its memory, so after the first few iterations
//we should no longer be performing allocations.
results.clear();
}
return pybind11::make_tuple(lefts, rights, areas);
}
PYBIND11_MODULE(geopandas_fast_sjoin,m){
m.doc() = "Fast spatial joins";
m.def("fast_sjoin", &fast_sjoin, "Performs a fast spatial join");
}
test.py
#!/usr/bin/env python3
import geopandas as gpd
import geopandas_fast_sjoin as gpfsj
import time
import os
import pickle
import sys
os.environ["GDAL_DATA"] = "/usr/share/gdal"
DATA_DIR = "./data/"
TRACT_FILE = "./data/tracts/tl_2010_{fips}_tract10.shp"
BLOCK_FILE = "./data/blocks/tabblock2010_{fips}_pophu.shp"
PRECINCT_FILE = "./data/precincts/precincts2008/USA_precincts.shp"
STATES_FILE = "./data/states/tl_2010_us_state10.shp"
PROJECTION = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'
print("Reading data...")
start_time = time.time()
tracts = gpd.read_file(TRACT_FILE.format(fips=48))
blocks = gpd.read_file(BLOCK_FILE.format(fips=48))
print('Time: ', time.time()-start_time )
print("Converting coordinate reference systems...")
start_time = time.time()
tracts = tracts.to_crs(PROJECTION)
blocks = blocks.to_crs(PROJECTION)
print('Time: ', time.time()-start_time )
print("Performing spatial join...")
start_time = time.time()
joined = gpd.sjoin(tracts, blocks, how='left')
joined['area_of_intersect'] = [row['geometry'].intersection(blocks.loc[row['index_right']]['geometry']).area for i,row in joined.iterrows()]
print('Time: ', time.time()-start_time )
# pickle.dump( (blocks,tracts), open( "save.p", "wb" ) )
# sys.exit(-1)
# blocks, tracts = pickle.load( open( "save.p", "rb" ) )
print("Getting geometries...")
start_time = time.time()
tgeoms = [x._geom for x in tracts['geometry']]
bgeoms = [x._geom for x in blocks['geometry']]
print('Time: ', time.time()-start_time )
print("Running example")
start_time = time.time()
lefts, rights, areas = gpfsj.fast_sjoin(tgeoms,bgeoms)
print('Time: ', time.time()-start_time )