Clickhouse 加入条件
Clickhouse join with condition
我发现了奇怪的东西,查询:
SELECT *
FROM progress as pp
ALL LEFT JOIN links as ll USING (viewId)
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8'
结果:0 rows in set. Elapsed: 5.267 sec. Processed 8.62 million rows, 484.94 MB (1.64 million rows/s., 92.08 MB/s.)
此处修改查询:
SELECT *
FROM
(SELECT *
FROM progress
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') AS p ALL
LEFT JOIN
(SELECT *
FROM links
WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) AS l ON p.viewId = l.viewId;
结果:0 rows in set. Elapsed: 0.076 sec. Processed 4.48 million rows, 161.35 MB (58.69 million rows/s., 2.12 GB/s.)
但是看起来很脏。
不是应该根据where条件优化查询吗?
在此处编写查询的正确方法是什么?如果在 where in 中呢?
然后我尝试添加另一个连接:
SELECT *
FROM
(SELECT videoUuid AS contentUuid,
viewId
FROM
(SELECT *
FROM progress
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') p ALL
LEFT JOIN
(SELECT *
FROM links
WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) USING `viewId`) ALL
LEFT JOIN `metaInfo` USING `viewId`,
`contentUuid`;
结果又很慢,考虑到我只想连接 3 个表,条件选择一行:
0 rows in set. Elapsed: 1.747 sec. Processed 9.13 million rows, 726.55 MB (5.22 million rows/s., 415.85 MB/s.)
此时 CH 不能很好地处理 multi-joins 查询(DB star-schema)并且查询优化器不够好,不能完全依赖它。
因此需要明确说明如何 'execute' 通过使用子查询而不是联接来进行查询。
考虑测试查询:
SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
WHERE r = 54320
/*
┌─────r─┐
│ 54320 │
└───────┘
1 rows in set. Elapsed: 6.261 sec. Processed 96.06 million rows, 768.52 MB (15.34 million rows/s., 122.74 MB/s.)
*/
让我们使用子查询重写它以显着加快速度。
SELECT number AS r
FROM numbers(87654321)
WHERE r = 54320 AND number IN (
SELECT number AS r
FROM numbers(7654321)
WHERE r = 54320 AND number IN (
SELECT number AS r
FROM numbers(654321)
WHERE r = 54320 AND number IN (
SELECT number AS r
FROM numbers(54321)
WHERE r = 54320
)
)
)
/*
┌─────r─┐
│ 54320 │
└───────┘
1 rows in set. Elapsed: 0.481 sec. Processed 96.06 million rows, 768.52 MB (199.69 million rows/s., 1.60 GB/s.)
*/
还有其他优化方法JOIN:
使用 External dictionary 摆脱 'small'-table
上的连接
使用Jointable引擎
使用ANY-严格
使用特定设置,例如 join_algorithm、partial_merge_join_optimizations 等
一些有用的参考:
Altinity webinar: Tips and tricks every ClickHouse user should know
Isn't it supposed to optimize the query concidering where condition?
尚未实现此类优化
这是预期的行为。
根据 CH doc https://clickhouse.tech/docs/en/sql-reference/statements/select/join/#performance“当 运行 连接一个 JOIN 时,相对于查询的其他阶段没有优化执行顺序。连接(右边的搜索 table) 运行 在 WHERE 过滤之前和聚合之前。"
我发现了奇怪的东西,查询:
SELECT *
FROM progress as pp
ALL LEFT JOIN links as ll USING (viewId)
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8'
结果:0 rows in set. Elapsed: 5.267 sec. Processed 8.62 million rows, 484.94 MB (1.64 million rows/s., 92.08 MB/s.)
此处修改查询:
SELECT *
FROM
(SELECT *
FROM progress
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') AS p ALL
LEFT JOIN
(SELECT *
FROM links
WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) AS l ON p.viewId = l.viewId;
结果:0 rows in set. Elapsed: 0.076 sec. Processed 4.48 million rows, 161.35 MB (58.69 million rows/s., 2.12 GB/s.)
但是看起来很脏。
不是应该根据where条件优化查询吗?
在此处编写查询的正确方法是什么?如果在 where in 中呢?
然后我尝试添加另一个连接:
SELECT *
FROM
(SELECT videoUuid AS contentUuid,
viewId
FROM
(SELECT *
FROM progress
WHERE viewId = 'a776a2f2-16ad-448a-858d-891e68bec9a8') p ALL
LEFT JOIN
(SELECT *
FROM links
WHERE viewId = toUUID('a776a2f2-16ad-448a-858d-891e68bec9a8')) USING `viewId`) ALL
LEFT JOIN `metaInfo` USING `viewId`,
`contentUuid`;
结果又很慢,考虑到我只想连接 3 个表,条件选择一行:
0 rows in set. Elapsed: 1.747 sec. Processed 9.13 million rows, 726.55 MB (5.22 million rows/s., 415.85 MB/s.)
此时 CH 不能很好地处理 multi-joins 查询(DB star-schema)并且查询优化器不够好,不能完全依赖它。
因此需要明确说明如何 'execute' 通过使用子查询而不是联接来进行查询。
考虑测试查询:
SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
WHERE r = 54320
/*
┌─────r─┐
│ 54320 │
└───────┘
1 rows in set. Elapsed: 6.261 sec. Processed 96.06 million rows, 768.52 MB (15.34 million rows/s., 122.74 MB/s.)
*/
让我们使用子查询重写它以显着加快速度。
SELECT number AS r
FROM numbers(87654321)
WHERE r = 54320 AND number IN (
SELECT number AS r
FROM numbers(7654321)
WHERE r = 54320 AND number IN (
SELECT number AS r
FROM numbers(654321)
WHERE r = 54320 AND number IN (
SELECT number AS r
FROM numbers(54321)
WHERE r = 54320
)
)
)
/*
┌─────r─┐
│ 54320 │
└───────┘
1 rows in set. Elapsed: 0.481 sec. Processed 96.06 million rows, 768.52 MB (199.69 million rows/s., 1.60 GB/s.)
*/
还有其他优化方法JOIN:
使用 External dictionary 摆脱 'small'-table
上的连接使用Jointable引擎
使用ANY-严格
使用特定设置,例如 join_algorithm、partial_merge_join_optimizations 等
一些有用的参考:
Altinity webinar: Tips and tricks every ClickHouse user should know
Isn't it supposed to optimize the query concidering where condition?
尚未实现此类优化
这是预期的行为。 根据 CH doc https://clickhouse.tech/docs/en/sql-reference/statements/select/join/#performance“当 运行 连接一个 JOIN 时,相对于查询的其他阶段没有优化执行顺序。连接(右边的搜索 table) 运行 在 WHERE 过滤之前和聚合之前。"