在具有多个输入的三元组上构建 BigQuery
Structuring BigQuery on trigrams with multiple inputs
目前,多亏了本次回答者的帮助,我才能够成功查询到一个词,并得到最热门的后续词列表。例如,使用单词 "great",我可以按以下格式获得最多 10 个单词的列表:
SELECT second, SUM(cell.page_count) total
FROM [publicdata:samples.trigrams]
WHERE first = "great"
group by 1
order by 2 desc
limit 10
输出:
second total
------------------
deal 3048832
and 1689911
, 1576341
a 1019511
number 984993
many 875974
importance 805215
part 739409
. 700694
as 628978
我目前遇到的问题是如何自动查询多个词(而不是每次都对一个单独的词调用查询),以便我可能会有这样的输出:
"great" total "new_word_1" new_total_1 ... "new_word_N" new_total_N
-----------------------------------------------------------------------------------------
deal 3048832 "new_follow_on_word1" 123456 ... "follow_on_N1" 234567
and 1689911 "new_follow_on_word2" 12345 ... "follow_on_N2" 123456
基本上我可以在单个查询中调用 N
个单词(例如,new_word_1
是一个完全不同的单词,如 "baseball", 没有 与 "great" 的关系),并获取与不同列中每个单词相关的总计数。
此外,在了解了 BigQuery 的 pricing 之后,我也很难弄清楚如何尽可能地限制查询的总数据量。我可以考虑只使用最新数据(比如 2010 年以后)和每个单词 2 个字母数字输出,但可能会缺少更明显的限制器
非常感谢对此的任何帮助 - 谢谢!
您可以在同一个查询中放置多个首词,但需要分别计算前 10 个后续词,然后将结果合并在一起。这是 "great" 和 "baseball"
的示例
SELECT word1, total1, word2, total2 FROM
(SELECT ROW_NUMBER() OVER() rowid1, word1, total1 FROM (
SELECT second as word1, SUM(cell.page_count) total1
FROM [publicdata:samples.trigrams]
WHERE first = "great"
group by 1
order by 2 desc
limit 10)) a1
JOIN
(SELECT ROW_NUMBER() OVER() rowid2, word2, total2 FROM (
SELECT second as word2, SUM(cell.page_count) total2
FROM [publicdata:samples.trigrams]
WHERE first = "baseball"
group by 1
order by 2 desc
limit 10)) a2
ON a1.rowid1 = a2.rowid2
目前,多亏了本次
SELECT second, SUM(cell.page_count) total
FROM [publicdata:samples.trigrams]
WHERE first = "great"
group by 1
order by 2 desc
limit 10
输出:
second total
------------------
deal 3048832
and 1689911
, 1576341
a 1019511
number 984993
many 875974
importance 805215
part 739409
. 700694
as 628978
我目前遇到的问题是如何自动查询多个词(而不是每次都对一个单独的词调用查询),以便我可能会有这样的输出:
"great" total "new_word_1" new_total_1 ... "new_word_N" new_total_N
-----------------------------------------------------------------------------------------
deal 3048832 "new_follow_on_word1" 123456 ... "follow_on_N1" 234567
and 1689911 "new_follow_on_word2" 12345 ... "follow_on_N2" 123456
基本上我可以在单个查询中调用 N
个单词(例如,new_word_1
是一个完全不同的单词,如 "baseball", 没有 与 "great" 的关系),并获取与不同列中每个单词相关的总计数。
此外,在了解了 BigQuery 的 pricing 之后,我也很难弄清楚如何尽可能地限制查询的总数据量。我可以考虑只使用最新数据(比如 2010 年以后)和每个单词 2 个字母数字输出,但可能会缺少更明显的限制器
非常感谢对此的任何帮助 - 谢谢!
您可以在同一个查询中放置多个首词,但需要分别计算前 10 个后续词,然后将结果合并在一起。这是 "great" 和 "baseball"
的示例SELECT word1, total1, word2, total2 FROM
(SELECT ROW_NUMBER() OVER() rowid1, word1, total1 FROM (
SELECT second as word1, SUM(cell.page_count) total1
FROM [publicdata:samples.trigrams]
WHERE first = "great"
group by 1
order by 2 desc
limit 10)) a1
JOIN
(SELECT ROW_NUMBER() OVER() rowid2, word2, total2 FROM (
SELECT second as word2, SUM(cell.page_count) total2
FROM [publicdata:samples.trigrams]
WHERE first = "baseball"
group by 1
order by 2 desc
limit 10)) a2
ON a1.rowid1 = a2.rowid2