Presto 中的压缩数组
Condensing arrays in Presto
我有一个使用 array_agg()
函数
生成数组字符串的查询
SELECT
array_agg(message) as sequence
from mytable
group by id
生成如下所示的 table:
sequence
1 foo foo bar baz bar baz
2 foo bar bar bar baz
3 foo foo foo bar bar baz
但我的目标是压缩字符串数组,以便 none 可以连续重复多次,例如,所需的输出如下所示:
sequence
1 foo bar baz bar baz
2 foo bar baz
3 foo bar baz
如何使用 Presto SQL 来做到这一点?
您可以通过以下两种方式之一执行此操作:
- 使用
array_distinct
函数从生成的数组中删除重复项:
WITH mytable(id, message) AS (VALUES
(1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
(2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
(3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_distinct(array_agg(message)) AS sequence
FROM mytable
GROUP BY id
- 在聚合中使用
DISTINCT
限定符删除重复值,然后再将它们传递到 array_agg。
WITH mytable(id, message) AS (VALUES
(1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
(2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'), (3, 'foo'),
(3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(DISTINCT message) AS sequence
FROM mytable
GROUP BY id
两种选择产生相同的结果:
sequence
-----------------
[foo, bar, baz]
[foo, bar, baz]
[foo, bar, baz]
(3 rows)
更新:您可以使用最近引入的 MATCH_RECOGNIZE
功能删除重复的元素序列:
WITH mytable(id, message) AS (VALUES
(1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'baz'), (1, 'bar'), (1, 'baz'),
(2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
(3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(value) AS sequence
FROM mytable
MATCH_RECOGNIZE(
PARTITION BY id
MEASURES A.message AS value
PATTERN (A B*)
DEFINE B AS message = PREV(message)
)
GROUP BY id
我有一个使用 array_agg()
函数
SELECT
array_agg(message) as sequence
from mytable
group by id
生成如下所示的 table:
sequence
1 foo foo bar baz bar baz
2 foo bar bar bar baz
3 foo foo foo bar bar baz
但我的目标是压缩字符串数组,以便 none 可以连续重复多次,例如,所需的输出如下所示:
sequence
1 foo bar baz bar baz
2 foo bar baz
3 foo bar baz
如何使用 Presto SQL 来做到这一点?
您可以通过以下两种方式之一执行此操作:
- 使用
array_distinct
函数从生成的数组中删除重复项:
WITH mytable(id, message) AS (VALUES
(1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
(2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
(3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_distinct(array_agg(message)) AS sequence
FROM mytable
GROUP BY id
- 在聚合中使用
DISTINCT
限定符删除重复值,然后再将它们传递到 array_agg。
WITH mytable(id, message) AS (VALUES
(1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
(2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'), (3, 'foo'),
(3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(DISTINCT message) AS sequence
FROM mytable
GROUP BY id
两种选择产生相同的结果:
sequence
-----------------
[foo, bar, baz]
[foo, bar, baz]
[foo, bar, baz]
(3 rows)
更新:您可以使用最近引入的 MATCH_RECOGNIZE
功能删除重复的元素序列:
WITH mytable(id, message) AS (VALUES
(1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'baz'), (1, 'bar'), (1, 'baz'),
(2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
(3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(value) AS sequence
FROM mytable
MATCH_RECOGNIZE(
PARTITION BY id
MEASURES A.message AS value
PATTERN (A B*)
DEFINE B AS message = PREV(message)
)
GROUP BY id