在 BigQuery 中计算 "colocation" 的最有效方法是什么？

Question

假设您有一个 table 的形式：

 vehicle_id | timestamp | lat | lon

创建查询以计算 "colocation" 的最有效方法是什么？并置意味着两辆车同时在几乎相同的位置。

我正在做的是首先从网格创建 cell_id（例如通过将 lat/lon 四舍五入到小数点后第四位来创建），然后运行在 cell_id（和时间）。有没有更有效的方法？

Answer 1

我建议使用 GeoHash。在 NYC 出租车数据上演示这一点并按时间按小时分组：

WITH top_pickup_locations AS (
SELECT
  TIMESTAMP_TRUNC(pickup_datetime, HOUR) AS hour,
  ST_GeoHash( ST_GeogPoint(pickup_longitude, pickup_latitude), 15 ) AS geohash,
  COUNT(*) AS num_pickups
FROM `bigquery-public-data.new_york.tlc_green_trips_2013`
GROUP BY hour, geohash
ORDER BY num_pickups DESC
LIMIT 10
)
SELECT
  hour,
  ST_GeogPointFromGeoHash(geohash),
  num_pickups
FROM top_pickup_locations

要了解有关 GeoHash 的更多信息，请参阅此处：https://en.wikipedia.org/wiki/Geohash 增加字符数（我用的是15个）来控制精度。

另一种选择是使用 ST_SnapToGrid() 而不是 geohash:

WITH top_pickup_locations AS (
SELECT
  TIMESTAMP_TRUNC(pickup_datetime, HOUR) AS hour,
  ST_ASGeoJson(ST_SnapToGrid( ST_GeogPoint(pickup_longitude, pickup_latitude), 0.0001)) AS cellid,
  COUNT(*) AS num_pickups
FROM `bigquery-public-data.new_york.tlc_green_trips_2013`
GROUP BY hour, cellid
ORDER BY num_pickups DESC
LIMIT 10
)
SELECT
  hour,
  ST_GeogFromGeoJson(cellid),
  num_pickups
FROM top_pickup_locations

我做的时候geohash方法用了11秒的slot time 而 snap-to-grid 方法需要 57 秒的时隙时间。 geohash的15位字符和lat-lon的4位数字在组数上大致相似

在 BigQuery 中计算 "colocation" 的最有效方法是什么？

What is the most efficient way to compute "colocation" in BigQuery?

sql

gis

google-bigquery