BigQuery 中的一对多关系

One-to-Many Relationship in BiqQuery

假设我有实体 ABC 以及 one-to-many 关系要存储在 BigQuery 中。

A -- (one to many) --> B --- (one to many) --> C

对于 "regular" SQL 数据库,我会创建 tables ABC 以及它们的主键和AB 中的外键基于 BC.

的主键

它适用于 BigQuery 吗?将该结构非规范化并将所有 ABC 存储在一个 table 中会更好吗?

假设每种农产品可以在不同的农场生产,每个农场都有许多不同的员工。

在 BigQuery 中,拥有 3 个表以及它们之间的关系并没有错 - 但您可能还想利用 BigQuery 的嵌套和重复列支持。

对于这个虚构的例子,我们可以按如下方式建模:

SELECT 'tomato' produce, STRUCT<farm ARRAY<STRUCT<farm_id string, employee ARRAY<STRUCT<name string>>>>>(
  [
    STRUCT('farm1' AS farm_id, [STRUCT('employee1' AS name), STRUCT('employee2')] AS employee ) 
     , ('farm2', [STRUCT('employee3' AS name), STRUCT('employee4')])
     , ('farm3', [STRUCT('employee5' AS name), STRUCT('employee6')])
  ]) AS farms
UNION ALL
SELECT 'lettuce', STRUCT<ARRAY<STRUCT<farm_id string, employee ARRAY<STRUCT<name string>>>>>(
  [
    STRUCT('farm4' AS farm_id, [STRUCT('employee7' AS name), STRUCT('employee8')] AS employee ) 
     , ('farm5', [STRUCT('employee9' AS name)])
  ]) AS farms

问:这样建模有意义吗?

答:视情况而定。

正如劳埃德所说:

Nested records have a couple of advantages when scanning over a distributed dataset. First, they do not require joins. This means that computations can be faster and scan much less data than if you had to rejoin the extra data each time you use it.

Nested structures are essentially pre-joined tables. And, because data is stored columnarly, if you don't reference the nested column, there is no added expense to the query. If you do reference the nested column, the logic is identical to a colocated join.

The other advantage that nested structures bring is that they avoid repeating data that would have to be repeated in a wide, denormalized table. In other words, for a person who's lived in five cities, a wide denormalized table would contain all of their info in five rows (one for each of the cities they'd lived in). In a nested structure, the repeated information only takes one row, since the array of five cities can be contained in a single row and only unnested when needed.

同时,对于不习惯处理嵌套数据的用户和工具来说,查询会更加困难。