如何在 BigQuery SQL 中将字符串列拆分为多行单个单词和单词对?
How do I split a string column into multi rows of single words & word pairs in BigQuery SQL?
我正在尝试(未成功)将 Google BigQuery 中的字符串列拆分为包含所有单个单词和所有单词对(彼此相邻并按顺序排列)的行。我还需要为 IndataTable 中的单词维护 ID 字段。两个记录集都有 2 列。
作为 IDT 的 IndataTable
ID WordString
1 个苹果香蕉梨
2 根胡萝卜
3蓝红绿黄
OutdataTable 作为 ODT
ID WordString
1 个苹果
1 根香蕉
1 个梨
1 个苹果香蕉
1 个香蕉梨
2 根胡萝卜
3 蓝色
3 红色
3 绿色
3 黄色
3蓝红
3 红绿
3绿黄(只有相邻的一对)
这在 BigQuery SQL?
中可能吗
Edit/Added:
到目前为止,这就是我所拥有的,可以将它分成单个单词。我真的很难弄清楚如何将其扩展到单词对。我不知道是否可以修改它或者我需要一个新的方法。
SELECT ID, split(WordString,' ') as Words
FROM (
select *
from
(select ID, WordString from IndataTable)
)
以下适用于 BigQuery 标准 SQL
#standardSQL
WITH IndataTable AS (
SELECT 1 id, 'apple banana pear' WordString UNION ALL
SELECT 2, 'carrot' UNION ALL
SELECT 3, 'blue red green yellow'
), words AS (
SELECT id, word, pos
FROM IndataTable, UNNEST(SPLIT(WordString,' ')) AS Word WITH OFFSET pos
), pairs AS (
SELECT id, CONCAT(word, ' ', LEAD(word) OVER(PARTITION BY id ORDER BY pos)) pair
FROM words
)
SELECT id, word AS WordString FROM words UNION ALL
SELECT id, pair AS WordString FROM pairs
WHERE NOT pair IS NULL
ORDER BY id
结果符合预期:
Row id WordString
1 1 apple
2 1 banana
3 1 pear
4 1 apple banana
5 1 banana pear
6 2 carrot
7 3 blue
8 3 red
9 3 green
10 3 yellow
11 3 blue red
12 3 red green
13 3 green yellow
我正在尝试(未成功)将 Google BigQuery 中的字符串列拆分为包含所有单个单词和所有单词对(彼此相邻并按顺序排列)的行。我还需要为 IndataTable 中的单词维护 ID 字段。两个记录集都有 2 列。
作为 IDT 的 IndataTable
ID WordString
1 个苹果香蕉梨
2 根胡萝卜
3蓝红绿黄
OutdataTable 作为 ODT
ID WordString
1 个苹果
1 根香蕉
1 个梨
1 个苹果香蕉
1 个香蕉梨
2 根胡萝卜
3 蓝色
3 红色
3 绿色
3 黄色
3蓝红
3 红绿
3绿黄(只有相邻的一对)
这在 BigQuery SQL?
Edit/Added:
到目前为止,这就是我所拥有的,可以将它分成单个单词。我真的很难弄清楚如何将其扩展到单词对。我不知道是否可以修改它或者我需要一个新的方法。
SELECT ID, split(WordString,' ') as Words
FROM (
select *
from
(select ID, WordString from IndataTable)
)
以下适用于 BigQuery 标准 SQL
#standardSQL
WITH IndataTable AS (
SELECT 1 id, 'apple banana pear' WordString UNION ALL
SELECT 2, 'carrot' UNION ALL
SELECT 3, 'blue red green yellow'
), words AS (
SELECT id, word, pos
FROM IndataTable, UNNEST(SPLIT(WordString,' ')) AS Word WITH OFFSET pos
), pairs AS (
SELECT id, CONCAT(word, ' ', LEAD(word) OVER(PARTITION BY id ORDER BY pos)) pair
FROM words
)
SELECT id, word AS WordString FROM words UNION ALL
SELECT id, pair AS WordString FROM pairs
WHERE NOT pair IS NULL
ORDER BY id
结果符合预期:
Row id WordString
1 1 apple
2 1 banana
3 1 pear
4 1 apple banana
5 1 banana pear
6 2 carrot
7 3 blue
8 3 red
9 3 green
10 3 yellow
11 3 blue red
12 3 red green
13 3 green yellow