Cloud Datastore 避免在非常简单的 table 上爆炸索引

Question

我正在尝试使用 Google Cloud Datastore 来存储 METAR 观测值（机场天气观测值），但我正在经历我认为是指数爆炸式增长的情况。 station_id（一个 4 字符的字符串）的索引比实际数据本身大 20 倍。数据库每天将增加大约 250 000 个实体，因此索引大小将成为一个问题。

Table

 - observation_time (Date / Time) - indexed
 - raw_text (String) (which is ~200 characters) - unindexed
 - station_id (String) (which is always 4 characters) - indexed

综合指数：

  - station_id (ASC), observation_time (ASC)

查询

我曾经运行唯一的查询是：

query.add_filter('station_id', '=', station_icao)
query.add_filter('observation_time', '>=', before)
query.add_filter('observation_time', '<=', after)

其中 before 和 after 是日期时间值

索引大小

name               type         count         size      index size
observation_time   Date/Time    1,096,184     26.14MB   313.62MB    
station_id         String       1,096,184     16.73MB   294.8MB

数据存储报告：

Resource           Count        Size
Entities           1,096,184    244.62MB
Built-in-indexes   5,488,986    740.63MB
Composite indexes  1,096,184    137.99MB

帮助

我想我的第一个问题是：我错过了什么？我假设我正在做一些未优化的事情，但我不知道是什么。查询时间在这里不是一个紧迫的问题，只要查找保持在 ~2 秒以下即可。

我可以简单地删除内置索引吗，复合会继续工作吗？

我已经阅读了 Google 和 Whosebug，但似乎无法理解这一点。我只是不尝试删除所有内置索引的原因是 download/un-index/put 所有数据需要相当长的时间，之后我需要 48 小时才能更新仪表板摘要 - 即需要几天时间在我得到结果之前。

Answer 1

正如 +Jeffrey Rennie 指出的那样，"Exploding Indexes" 是一个非常具体的术语，不适用于此处。

您可以看到如何根据 our documentation here 计算存储大小，因此您可以将其应用到您的示例中以查看大小相加的位置。

TL;DR: 您可以通过使用稍微更简洁（但仍然可读！）属性的名称来保存 space。例如，observation_time 到 observation，等等

要牢记的关键事项：

要有复合索引，您需要为各个属性编制索引，所以不要删除内置项，否则它会停止工作
内置索引两次 - 一次用于升序，一次用于降序
种类名称和属性名称是每个实体索引中使用的字符串，因此它们越长索引越大

Cloud Datastore 避免在非常简单的 table 上爆炸索引

Cloud Datastore avoid exploding indexes on very simple table

indexing

data-modeling

google-cloud-datastore

google-cloud-platform