如何计算从一个 table 到另一个 table 中评论的单词出现次数
How to count the occurrences of words from one table to comments in another table
我正在尝试在 Google 的 BigQuery 中完成一项任务,这可能需要我不确定 SQL 是否可以本地处理的逻辑。
我有 2 个 table:
- 第一个 table 有一个单独的列,其中每一行都是一个小写单词
- 第二个 table 是一个评论数据库(包含发表评论的人、评论本身、时间戳等数据)
我想按第一个 table 中单词出现的次数对第二个 table 中的评论进行排序。
这是我想做的事情的一个基本示例,使用 python,使用字母而不是单词...但你明白了:
words = ['a','b','c','d','e']
comments = ['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']
wordcount = {}
for comment in comments:
for word in words:
if word in comment:
if comment in wordcount:
wordcount[comment] += 1
else:
wordcount[comment] = 1
print(sorted(wordcount.items(), key = lambda k: k[1], reverse=True))
输出:
[('look another sentence, which is also a comment', 3), ('this is another comment', 3), ('this is the first sentence', 2), ('nope', 1)]
到目前为止,我所见过的生成 SQL 查询的最好方法是执行如下操作:
SELECT
COUNT(*)
FROM
table
WHERE
comment_col like '%word1%'
OR comment_col like '%word2%'
OR ...
但是有2000多字了……就是感觉不对。有什么建议吗?
如果我理解得很好,我想你需要这样的查询:
select comment, count(*) cnt
from comments
join words
on comment like '% ' + word + ' %' --this checks for `... word ..`; a word between spaces
or comment like word + ' %' --this checks for `word ..`; a word at the start of comment
or comment like '% ' + word --this checks for `.. word`; a word at the end of comment
or comment = word --this checks for `word`; whole comment is the word
group by comment
order by count(*) desc
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
-- ORDER BY cnt DESC
作为一个选项,您可以根据需要使用正则表达式:
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, word)
GROUP BY comment
-- ORDER BY cnt DESC
您可以使用问题中的虚拟示例测试/玩上面的内容
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','b','c','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
ORDER BY cnt DESC
更新:
Do have any quick suggestions to do full string match only?
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','no','is','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, CONCAT(r'\b', word, r'\b'))
GROUP BY comment
ORDER BY cnt DESC
我正在尝试在 Google 的 BigQuery 中完成一项任务,这可能需要我不确定 SQL 是否可以本地处理的逻辑。
我有 2 个 table:
- 第一个 table 有一个单独的列,其中每一行都是一个小写单词
- 第二个 table 是一个评论数据库(包含发表评论的人、评论本身、时间戳等数据)
我想按第一个 table 中单词出现的次数对第二个 table 中的评论进行排序。
这是我想做的事情的一个基本示例,使用 python,使用字母而不是单词...但你明白了:
words = ['a','b','c','d','e']
comments = ['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']
wordcount = {}
for comment in comments:
for word in words:
if word in comment:
if comment in wordcount:
wordcount[comment] += 1
else:
wordcount[comment] = 1
print(sorted(wordcount.items(), key = lambda k: k[1], reverse=True))
输出:
[('look another sentence, which is also a comment', 3), ('this is another comment', 3), ('this is the first sentence', 2), ('nope', 1)]
到目前为止,我所见过的生成 SQL 查询的最好方法是执行如下操作:
SELECT
COUNT(*)
FROM
table
WHERE
comment_col like '%word1%'
OR comment_col like '%word2%'
OR ...
但是有2000多字了……就是感觉不对。有什么建议吗?
如果我理解得很好,我想你需要这样的查询:
select comment, count(*) cnt
from comments
join words
on comment like '% ' + word + ' %' --this checks for `... word ..`; a word between spaces
or comment like word + ' %' --this checks for `word ..`; a word at the start of comment
or comment like '% ' + word --this checks for `.. word`; a word at the end of comment
or comment = word --this checks for `word`; whole comment is the word
group by comment
order by count(*) desc
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
-- ORDER BY cnt DESC
作为一个选项,您可以根据需要使用正则表达式:
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, word)
GROUP BY comment
-- ORDER BY cnt DESC
您可以使用问题中的虚拟示例测试/玩上面的内容
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','b','c','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0
GROUP BY comment
ORDER BY cnt DESC
更新:
Do have any quick suggestions to do full string match only?
#standardSQL
WITH words AS (
SELECT word
FROM UNNEST(['a','no','is','d','e']) word
),
comments AS (
SELECT comment
FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, CONCAT(r'\b', word, r'\b'))
GROUP BY comment
ORDER BY cnt DESC