Bigquery - 使用 SemiJoins 过滤重复字段
Bigquery - Filtering on Repeated Fields Using SemiJoins
我正在尝试 select 根据重复字段中的项目是否位于另一个 table 中的列来记录来自一个 table 的记录。当我在我的代码中明确列出我正在测试的项目时,我已经能够做到这一点,但当 select 从另一个 table 中获取时,我就无法做到这一点。让我演示如何使用三元组数据集:
假设我想 select 所有在特定年份出现的记录。但我不只是想要那些年的数据——我想要与这些记录相关的所有数据。如果我只想要几年的数据,我可以做这样的事情(这很有效):
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
SOME(cell.value in ('1800', '1801')) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid
但是,我没有将“1800”和“1801”编码到我的查询中,而是有一个 table years
,其中包含一组我感兴趣的年份。我希望这样工作:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
SOME(cell.value in (SELECT year_as_str FROM [mydataset.years])) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid
这不起作用,因为 bigquery 要求半连接是 WHERE
或 HAVING
子句的一部分。
所以我尝试重新排列(回到第一个查询):
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802')) WITHIN RECORD
这会导致错误 Encountered " "WITHIN" "WITHIN "" ... Was expecting <EOF>
所以现在没有 WITHIN RECORD
:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802'))
这会导致错误 SELECT clause has mix of aggregations '...' and fields '...' without GROUP BY clause
但我不是聚合!所以现在我将过滤器移动到 WHERE
:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
WHERE SOME(cell.value in ('1801', '1802'))
这告诉我 Invalid function name: SOME
。什么?!
有没有办法通过 BigQuery 获得我正在寻找的行为?
下面解决了您的示例,我希望您能够将其扩展到您的实际用例(如果您愿意的话)
SELECT
ngram, cell.value, cell.volume_count,
cell.volume_fraction, cell.page_count, cell.match_count
FROM [publicdata:samples.trigrams] AS trigrams
JOIN (
SELECT ngram AS qualified
FROM (
FLATTEN((SELECT ngram, cell.value AS value
FROM (FLATTEN([publicdata:samples.trigrams], cell.value))), value)
) AS t
JOIN [mydataset.years] AS y
ON y.year_as_str = t.value
GROUP BY 1
) AS valid
ON valid.qualified = trigrams.ngram
请注意 [publicdata:samples.trigrams]
字段 cell.value
是 REPEATED STRING
- 这就是为什么您会看到 "extra" FLATTEN things
您可以为此使用 OMIT RECORD IF
子句。这可能需要双重否定,因为您需要省略满足某些条件的记录。以下查询应该有效:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
OMIT RECORD IF EVERY(cell.value NOT IN ('1801', '1802'))
我正在尝试 select 根据重复字段中的项目是否位于另一个 table 中的列来记录来自一个 table 的记录。当我在我的代码中明确列出我正在测试的项目时,我已经能够做到这一点,但当 select 从另一个 table 中获取时,我就无法做到这一点。让我演示如何使用三元组数据集:
假设我想 select 所有在特定年份出现的记录。但我不只是想要那些年的数据——我想要与这些记录相关的所有数据。如果我只想要几年的数据,我可以做这样的事情(这很有效):
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
SOME(cell.value in ('1800', '1801')) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid
但是,我没有将“1800”和“1801”编码到我的查询中,而是有一个 table years
,其中包含一组我感兴趣的年份。我希望这样工作:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
SOME(cell.value in (SELECT year_as_str FROM [mydataset.years])) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid
这不起作用,因为 bigquery 要求半连接是 WHERE
或 HAVING
子句的一部分。
所以我尝试重新排列(回到第一个查询):
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802')) WITHIN RECORD
这会导致错误 Encountered " "WITHIN" "WITHIN "" ... Was expecting <EOF>
所以现在没有 WITHIN RECORD
:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802'))
这会导致错误 SELECT clause has mix of aggregations '...' and fields '...' without GROUP BY clause
但我不是聚合!所以现在我将过滤器移动到 WHERE
:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
WHERE SOME(cell.value in ('1801', '1802'))
这告诉我 Invalid function name: SOME
。什么?!
有没有办法通过 BigQuery 获得我正在寻找的行为?
下面解决了您的示例,我希望您能够将其扩展到您的实际用例(如果您愿意的话)
SELECT
ngram, cell.value, cell.volume_count,
cell.volume_fraction, cell.page_count, cell.match_count
FROM [publicdata:samples.trigrams] AS trigrams
JOIN (
SELECT ngram AS qualified
FROM (
FLATTEN((SELECT ngram, cell.value AS value
FROM (FLATTEN([publicdata:samples.trigrams], cell.value))), value)
) AS t
JOIN [mydataset.years] AS y
ON y.year_as_str = t.value
GROUP BY 1
) AS valid
ON valid.qualified = trigrams.ngram
请注意 [publicdata:samples.trigrams]
字段 cell.value
是 REPEATED STRING
- 这就是为什么您会看到 "extra" FLATTEN things
您可以为此使用 OMIT RECORD IF
子句。这可能需要双重否定,因为您需要省略满足某些条件的记录。以下查询应该有效:
SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
OMIT RECORD IF EVERY(cell.value NOT IN ('1801', '1802'))