BigQuery 从查询中创建重复记录字段

Question

是否可以在 BigQuery 中创建重复记录列？例如，对于以下数据：

| a | b | c |
-------------
| 1 | 5 | 2 |
-------------
| 1 | 3 | 1 |
-------------
| 2 | 2 | 1 |

以下是否可行？

Select a, NEST(b, c) as d from *table* group by a

产生以下结果

| a | d.b | d.c |
-----------------
| 1 |  5  |  2  |
-----------------
|   |  3  |  1  |
-----------------
| 2 |  2  |  1  |

Answer 1

BigQuery automatically flattens query results, so if you use the NEST function on the top level query, the results won't contain repeated fields. Use the NEST function when using a subselect that produces intermediate results for immediate use by the same query.

在 https://cloud.google.com/bigquery/query-reference#aggfunctions

查看更多关于 NEST() 的信息

同时检查作为 FYI

还有一点要记住 - 你只能嵌套一个字段 - NEST(b) 但不能 NEST(b, c)

就是说 - 您可以生成类似于您所要求的结果，但您需要将其写入 table

根据我的经验：用户在将数据加载到 BigQuery 时更多地面临这个问题——也就是说，用户可以根据需要使用具有复杂架构的 nlJSON。在 GBQ 本身，用户通常对分析、聚合更感兴趣，因此上述类型的问题出现的频率较低。我认为当前的 GBQ sysntax friendly/flexible 不足以生成 "complex" hierarchical/nested 模式的数据并将其插入 table all in GBQ。不过，我认为解决方法是可能的，但取决于具体用例

Answer 2

绕过 NEST() 限制 "nesting" 只有一个字段的方法之一是使用 BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. And most important from my prospective to have in mind - they are quite a candidates for being qualified as expensive High-Compute queries

Complex queries can consume extraordinarily large computing resources relative to the number of bytes processed. Typically, such queries contain a very large number of JOIN or CROSS JOIN clauses or complex User-defined Functions.

因此，下面是 "mimic" NEST(b, c) 来自 questino 示例的示例：

SELECT a, d.b, d.c FROM 
JS((      // input table
  SELECT a, NEST(CONCAT(STRING(b), ',', STRING(c))) AS d
  FROM (
    SELECT * FROM 
    (SELECT 1 AS a, 5 AS b, 2 AS c),
    (SELECT 1 AS a, 3 AS b, 1 AS c),
    (SELECT 2 AS a, 2 AS b, 1 AS c)
  ) GROUP BY a),
  a, d,     // input columns
  "[{'name': 'a', 'type': 'INTEGER'},    // output schema
    {'name': 'd', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'b', 'type': 'STRING'},
       {'name': 'c', 'type': 'STRING'}
     ]    
    }
  ]",
  "function(row, emit){    // function 
    var c = [];
    for (var i = 0; i < row.d.length; i++) {
      x = row.d[i].toString().split(',');
      t = {b:x[0], c:x[1]}
      c.push(t);
    };
    emit({a: row.a, d: c});  
  }"
)

比较简单。我希望你能够通过它并得到一个想法

仍然 - 记住：

No matter how you create record with nested/repeated fields - BigQuery automatically flattens query results, so visible results won't contain repeated fields. So you should use it as a subselect that produces intermediate results for immediate use by the same query.

仅供参考，您可以通过运行下面的查询

自己证明上面的 returns 只有两条记录（不是展平时看起来的三个）

SELECT COUNT(1) AS rows FROM (
  <above query here>
)

另一个重要注意事项：
这是一个已知的 NEST() 与 UnFlatten Results 输出不兼容并且主要用于子查询中的中间结果。
相比之下，上述解决方案可以很容易地直接保存到 table（未选中 Flatten Results）

Answer 3

随着 BigQuery Standard SQL 的推出，我们有了处理记录的简便方法
试试下面，不要忘记取消选中 显示选项

下的 Use Legacy SQL 复选框

WITH YourTable AS (
  SELECT 1 AS a, 5 AS b, 2 AS c UNION ALL
  SELECT 1 AS a, 3 AS b, 1 AS c UNION ALL
  SELECT 2 AS a, 2 AS b, 1 AS c
)
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS d
FROM YourTable 
GROUP BY a

BigQuery 从查询中创建重复记录字段

BigQuery creat repeated record field from query

google-bigquery