优化 Redshift 查询的大 IN 条件
Optimize large IN condition for Redshift query
我有一个 ~2TB 完全清理的 Redshift table,带有 distkey phash
(高基数,数亿个值)和复合排序键 (phash, last_seen)
.
当我执行如下查询时:
SELECT
DISTINCT ret_field
FROM
table
WHERE
phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae'
)
AND
last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
它returns 很快。但是,当我将哈希数增加到 10 以上时,Redshift 将 IN 条件从一堆 OR 转换为数组,每个 http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
问题是当我有几十个 phash
值时,"optimized" 查询的响应时间从不到一秒钟到超过半小时。换句话说,它停止使用排序键并进行完整的 table 扫描。
知道如何防止这种行为并保留使用排序键来保持快速查询吗?
这是 <10 哈希值和 >10 哈希值之间的 EXPLAIN
区别:
小于 10(0.4 秒):
XN Unique (cost=0.00..157253450.20 rows=43 width=27)
-> XN Seq Scan on table (cost=0.00..157253393.92 rows=22510 width=27)
Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))
超过 10 个(45-60 分钟):
XN Unique (cost=0.00..181985241.25 rows=1717530 width=27)
-> XN Seq Scan on table (cost=0.00..179718164.48 rows=906830708 width=27)
Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))
您可以尝试创建临时 table/subquery:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (
SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
UNION ALL
SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
UNION ALL
SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
-- UNION ALL
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
或者分块搜索(如果查询优化器将其合并为一个,使用辅助 table 来存储中间结果):
SELECT ret_field
FROM table
WHERE phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae')
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
-- ...
如果查询优化器将其合并为一个,您可以尝试使用临时 table 获得中间结果
编辑:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
FROM ...
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
你真的需要 DISTINCT
吗?这个运算符可能很昂贵。
我会尝试使用 LATERAL JOIN
。在下面的查询中 table Hashes
有一个列 phash
- 这是你的大批量哈希。它可以是一个临时 table,一个(子)查询,任何东西。
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN LATERAL
(
SELECT table.ret_field
FROM table
WHERE
table.phash = Hashes.phash
AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
) AS T ON true
优化器很可能将 LATERAL JOIN
实现为嵌套循环。它将遍历 Hashes
中的所有行,并针对 运行 中的每一行遍历 SELECT FROM table
。内部 SELECT
应该使用您在 (phash, last_seen)
上的索引。为了安全起见,还要在索引中包含 ret_field
以使其成为覆盖索引:(phash, last_seen, ret_field)
.
@Diego 的回答中有一个非常有效的观点:与其将常量 phash
值放入查询中,不如将它们放入临时或永久 table.
中
我想扩展@Diego 的答案并补充一点,这个带有散列的 table 有索引,唯一索引很重要。
因此,创建一个 table Hashes
,其中一列 phash
的类型与主 table.phash
中的类型完全相同。类型匹配很重要。使该列成为具有唯一聚集索引的主键。将你的数十个 phash
值转储到 Hashes
table.
那么查询就变成了简单的INNER JOIN
,不是横向的:
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN table ON table.phash = Hashes.phash
WHERE
table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
table
在 (phash, last_seen, ret_field)
上有索引仍然很重要。
优化器应该能够利用这样一个事实,即两个连接的 table 都按 phash
列排序,并且它在 Hashes
table 中是唯一的.
设置sortkeys (last_seen, phash)
值得一试,把last_seen
放在第一位。
速度慢的原因可能是因为排序键的前导列是 phash
,它看起来像一个随机字符。
正如 AWS redshift 开发文档所说,如果将时间戳列用于 where 条件,则时间戳列应作为排序键的前导列。
If recent data is queried most frequently, specify the timestamp
column as the leading column for the sort key.
- Choose the Best Sort Key - Amazon Redshift
使用排序键的这种顺序,所有列将按 last_seen
,然后 phash
排序。 (What does it mean to have multiple sortkey columns?)
需要注意的是,您必须重新创建 table 才能更改排序键。 This 将帮助您做到这一点。
您可以通过将您想要的数据插入临时 table 并将其与您的实际 table.
相结合来摆脱 "ORs"
这是一个示例(我正在使用 CTE,因为当您有多个 SQL 语句时,我使用的工具很难捕获计划 - 但使用临时 table如果可以的话)
select *
from <my_table>
where checksum in
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
对抗
with foo as
(
select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select *
from <my_table> r
join foo f on r.checksum = F.my_key
这是计划,如您所见,它看起来更复杂,但那是因为 CTE,在临时环境下看起来不会那样 table:
您是否尝试过对所有 phash 值使用并集?
就这样:
SELECT ret_field
FROM table
WHERE phash = '5c8615fa967576019f846b55f11b6e41' -- 1st phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '8719c8caa9740bec10f914fc2434ccfd' -- 2nd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '9b657c9f6bf7c5bbd04b5baf94e61dae' -- 3rd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
-- and so on...
UNION
SELECT ret_field
FROM table
WHERE phash = 'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn' -- Nth phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
我有一个 ~2TB 完全清理的 Redshift table,带有 distkey phash
(高基数,数亿个值)和复合排序键 (phash, last_seen)
.
当我执行如下查询时:
SELECT
DISTINCT ret_field
FROM
table
WHERE
phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae'
)
AND
last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
它returns 很快。但是,当我将哈希数增加到 10 以上时,Redshift 将 IN 条件从一堆 OR 转换为数组,每个 http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
问题是当我有几十个 phash
值时,"optimized" 查询的响应时间从不到一秒钟到超过半小时。换句话说,它停止使用排序键并进行完整的 table 扫描。
知道如何防止这种行为并保留使用排序键来保持快速查询吗?
这是 <10 哈希值和 >10 哈希值之间的 EXPLAIN
区别:
小于 10(0.4 秒):
XN Unique (cost=0.00..157253450.20 rows=43 width=27)
-> XN Seq Scan on table (cost=0.00..157253393.92 rows=22510 width=27)
Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))
超过 10 个(45-60 分钟):
XN Unique (cost=0.00..181985241.25 rows=1717530 width=27)
-> XN Seq Scan on table (cost=0.00..179718164.48 rows=906830708 width=27)
Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))
您可以尝试创建临时 table/subquery:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (
SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
UNION ALL
SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
UNION ALL
SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
-- UNION ALL
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
或者分块搜索(如果查询优化器将其合并为一个,使用辅助 table 来存储中间结果):
SELECT ret_field
FROM table
WHERE phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae')
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
-- ...
如果查询优化器将其合并为一个,您可以尝试使用临时 table 获得中间结果
编辑:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
FROM ...
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
你真的需要 DISTINCT
吗?这个运算符可能很昂贵。
我会尝试使用 LATERAL JOIN
。在下面的查询中 table Hashes
有一个列 phash
- 这是你的大批量哈希。它可以是一个临时 table,一个(子)查询,任何东西。
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN LATERAL
(
SELECT table.ret_field
FROM table
WHERE
table.phash = Hashes.phash
AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
) AS T ON true
优化器很可能将 LATERAL JOIN
实现为嵌套循环。它将遍历 Hashes
中的所有行,并针对 运行 中的每一行遍历 SELECT FROM table
。内部 SELECT
应该使用您在 (phash, last_seen)
上的索引。为了安全起见,还要在索引中包含 ret_field
以使其成为覆盖索引:(phash, last_seen, ret_field)
.
@Diego 的回答中有一个非常有效的观点:与其将常量 phash
值放入查询中,不如将它们放入临时或永久 table.
我想扩展@Diego 的答案并补充一点,这个带有散列的 table 有索引,唯一索引很重要。
因此,创建一个 table Hashes
,其中一列 phash
的类型与主 table.phash
中的类型完全相同。类型匹配很重要。使该列成为具有唯一聚集索引的主键。将你的数十个 phash
值转储到 Hashes
table.
那么查询就变成了简单的INNER JOIN
,不是横向的:
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN table ON table.phash = Hashes.phash
WHERE
table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
table
在 (phash, last_seen, ret_field)
上有索引仍然很重要。
优化器应该能够利用这样一个事实,即两个连接的 table 都按 phash
列排序,并且它在 Hashes
table 中是唯一的.
设置sortkeys (last_seen, phash)
值得一试,把last_seen
放在第一位。
速度慢的原因可能是因为排序键的前导列是 phash
,它看起来像一个随机字符。
正如 AWS redshift 开发文档所说,如果将时间戳列用于 where 条件,则时间戳列应作为排序键的前导列。
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift
使用排序键的这种顺序,所有列将按 last_seen
,然后 phash
排序。 (What does it mean to have multiple sortkey columns?)
需要注意的是,您必须重新创建 table 才能更改排序键。 This 将帮助您做到这一点。
您可以通过将您想要的数据插入临时 table 并将其与您的实际 table.
相结合来摆脱 "ORs"这是一个示例(我正在使用 CTE,因为当您有多个 SQL 语句时,我使用的工具很难捕获计划 - 但使用临时 table如果可以的话)
select *
from <my_table>
where checksum in
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
对抗
with foo as
(
select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select *
from <my_table> r
join foo f on r.checksum = F.my_key
这是计划,如您所见,它看起来更复杂,但那是因为 CTE,在临时环境下看起来不会那样 table:
您是否尝试过对所有 phash 值使用并集?
就这样:
SELECT ret_field
FROM table
WHERE phash = '5c8615fa967576019f846b55f11b6e41' -- 1st phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '8719c8caa9740bec10f914fc2434ccfd' -- 2nd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '9b657c9f6bf7c5bbd04b5baf94e61dae' -- 3rd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
-- and so on...
UNION
SELECT ret_field
FROM table
WHERE phash = 'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn' -- Nth phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'