BigQuery 从查询中创建重复记录字段
BigQuery creat repeated record field from query
是否可以在 BigQuery 中创建重复记录列?例如,对于以下数据:
| a | b | c |
-------------
| 1 | 5 | 2 |
-------------
| 1 | 3 | 1 |
-------------
| 2 | 2 | 1 |
以下是否可行?
Select a, NEST(b, c) as d from *table* group by a
产生以下结果
| a | d.b | d.c |
-----------------
| 1 | 5 | 2 |
-----------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |
BigQuery automatically flattens query results, so if you use the NEST
function on the top level query, the results won't contain repeated
fields. Use the NEST function when using a subselect that produces
intermediate results for immediate use by the same query.
在 https://cloud.google.com/bigquery/query-reference#aggfunctions
查看更多关于 NEST() 的信息
同时检查 作为 FYI
还有一点要记住 - 你只能嵌套一个字段 - NEST(b)
但不能 NEST(b, c)
就是说 - 您可以生成类似于您所要求的结果,但您需要将其写入 table
根据我的经验:用户在将数据加载到 BigQuery 时更多地面临这个问题——也就是说,用户可以根据需要使用具有复杂架构的 nlJSON。
在 GBQ 本身,用户通常对分析、聚合更感兴趣,因此上述类型的问题出现的频率较低。我认为当前的 GBQ sysntax friendly/flexible 不足以生成 "complex" hierarchical/nested 模式的数据并将其插入 table all in GBQ。不过,我认为解决方法是可能的,但取决于具体用例
绕过 NEST()
限制 "nesting" 只有一个字段的方法之一是使用 BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. And most important from my prospective to have in mind - they are quite a candidates for being qualified as expensive High-Compute queries
Complex queries can consume extraordinarily large computing resources
relative to the number of bytes processed. Typically, such queries
contain a very large number of JOIN or CROSS JOIN clauses or complex
User-defined Functions.
因此,下面是 "mimic" NEST(b, c) 来自 questino 示例的示例:
SELECT a, d.b, d.c FROM
JS(( // input table
SELECT a, NEST(CONCAT(STRING(b), ',', STRING(c))) AS d
FROM (
SELECT * FROM
(SELECT 1 AS a, 5 AS b, 2 AS c),
(SELECT 1 AS a, 3 AS b, 1 AS c),
(SELECT 2 AS a, 2 AS b, 1 AS c)
) GROUP BY a),
a, d, // input columns
"[{'name': 'a', 'type': 'INTEGER'}, // output schema
{'name': 'd', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'b', 'type': 'STRING'},
{'name': 'c', 'type': 'STRING'}
]
}
]",
"function(row, emit){ // function
var c = [];
for (var i = 0; i < row.d.length; i++) {
x = row.d[i].toString().split(',');
t = {b:x[0], c:x[1]}
c.push(t);
};
emit({a: row.a, d: c});
}"
)
比较简单。我希望你能够通过它并得到一个想法
仍然 - 记住:
No matter how you create record with nested/repeated fields - BigQuery
automatically flattens query results, so visible results won't contain
repeated fields. So you should use it as a subselect that produces
intermediate results for immediate use by the same query.
仅供参考,您可以通过 运行 下面的查询
自己证明上面的 returns 只有两条记录(不是展平时看起来的三个)
SELECT COUNT(1) AS rows FROM (
<above query here>
)
另一个重要注意事项:
这是一个已知的 NEST()
与 UnFlatten Results
输出不兼容并且主要用于子查询中的中间结果。
相比之下,上述解决方案可以很容易地直接保存到 table(未选中 Flatten Results)
随着 BigQuery Standard SQL 的推出,我们有了处理记录的简便方法
试试下面,不要忘记取消选中 显示选项
下的 Use Legacy SQL
复选框
WITH YourTable AS (
SELECT 1 AS a, 5 AS b, 2 AS c UNION ALL
SELECT 1 AS a, 3 AS b, 1 AS c UNION ALL
SELECT 2 AS a, 2 AS b, 1 AS c
)
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS d
FROM YourTable
GROUP BY a
是否可以在 BigQuery 中创建重复记录列?例如,对于以下数据:
| a | b | c |
-------------
| 1 | 5 | 2 |
-------------
| 1 | 3 | 1 |
-------------
| 2 | 2 | 1 |
以下是否可行?
Select a, NEST(b, c) as d from *table* group by a
产生以下结果
| a | d.b | d.c |
-----------------
| 1 | 5 | 2 |
-----------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |
BigQuery automatically flattens query results, so if you use the NEST function on the top level query, the results won't contain repeated fields. Use the NEST function when using a subselect that produces intermediate results for immediate use by the same query.
在 https://cloud.google.com/bigquery/query-reference#aggfunctions
查看更多关于 NEST() 的信息同时检查
还有一点要记住 - 你只能嵌套一个字段 - NEST(b)
但不能 NEST(b, c)
就是说 - 您可以生成类似于您所要求的结果,但您需要将其写入 table
根据我的经验:用户在将数据加载到 BigQuery 时更多地面临这个问题——也就是说,用户可以根据需要使用具有复杂架构的 nlJSON。 在 GBQ 本身,用户通常对分析、聚合更感兴趣,因此上述类型的问题出现的频率较低。我认为当前的 GBQ sysntax friendly/flexible 不足以生成 "complex" hierarchical/nested 模式的数据并将其插入 table all in GBQ。不过,我认为解决方法是可能的,但取决于具体用例
绕过 NEST()
限制 "nesting" 只有一个字段的方法之一是使用 BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. And most important from my prospective to have in mind - they are quite a candidates for being qualified as expensive High-Compute queries
Complex queries can consume extraordinarily large computing resources relative to the number of bytes processed. Typically, such queries contain a very large number of JOIN or CROSS JOIN clauses or complex User-defined Functions.
因此,下面是 "mimic" NEST(b, c) 来自 questino 示例的示例:
SELECT a, d.b, d.c FROM
JS(( // input table
SELECT a, NEST(CONCAT(STRING(b), ',', STRING(c))) AS d
FROM (
SELECT * FROM
(SELECT 1 AS a, 5 AS b, 2 AS c),
(SELECT 1 AS a, 3 AS b, 1 AS c),
(SELECT 2 AS a, 2 AS b, 1 AS c)
) GROUP BY a),
a, d, // input columns
"[{'name': 'a', 'type': 'INTEGER'}, // output schema
{'name': 'd', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'b', 'type': 'STRING'},
{'name': 'c', 'type': 'STRING'}
]
}
]",
"function(row, emit){ // function
var c = [];
for (var i = 0; i < row.d.length; i++) {
x = row.d[i].toString().split(',');
t = {b:x[0], c:x[1]}
c.push(t);
};
emit({a: row.a, d: c});
}"
)
比较简单。我希望你能够通过它并得到一个想法
仍然 - 记住:
No matter how you create record with nested/repeated fields - BigQuery automatically flattens query results, so visible results won't contain repeated fields. So you should use it as a subselect that produces intermediate results for immediate use by the same query.
仅供参考,您可以通过 运行 下面的查询
自己证明上面的 returns 只有两条记录(不是展平时看起来的三个)SELECT COUNT(1) AS rows FROM (
<above query here>
)
另一个重要注意事项:
这是一个已知的 NEST()
与 UnFlatten Results
输出不兼容并且主要用于子查询中的中间结果。
相比之下,上述解决方案可以很容易地直接保存到 table(未选中 Flatten Results)
随着 BigQuery Standard SQL 的推出,我们有了处理记录的简便方法
试试下面,不要忘记取消选中 显示选项
Use Legacy SQL
复选框
WITH YourTable AS (
SELECT 1 AS a, 5 AS b, 2 AS c UNION ALL
SELECT 1 AS a, 3 AS b, 1 AS c UNION ALL
SELECT 2 AS a, 2 AS b, 1 AS c
)
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS d
FROM YourTable
GROUP BY a