优化大型查询以使用更少的 JOIN,计算不同的参与
Optimizing large query to use less JOINs, count distinct involved
在 Google BigQuery 上,我们有一个包含大约 10 列的报告,例如:
+----------------+-----------------+---------------+-------------+
| uniquesent | uniquedelivered | uniquebounced | uniqueopens |
+----------------+-----------------+---------------+-------------+
我们有一个较长的查询,它使用大量连接来计算这些值,大查询大致是这样组织的:
select
...report_columns...,
sent.uniquesent,
delivered.uniquedelivered,
from [main table]
left join (
select
language,
exact_count_distinct(e.user_id) as uniquesent
from emailevent e
where country=1 and event='sent'
group by 1
) as sent
left join (
select
language,
exact_count_distinct(e.user_id) as uniquedelivered
from emailevent e
where country=1 and event='delivered'
group by 1
) as delivered
和这份 JOINs
列表中的其他 10 个类似项目采用相同的样式。还可以想象这个查询包含 group by day/week/month 部分,即使阅读起来也变得非常复杂。我们还收到其中一些的错误消息:超出资源。
我们想将查询重写和优化为 return 相同的数字,但效率更高。如果您有其他问题,请告诉我,但主要是我们想以某种方式消除联接并使其紧凑并性能更好。
我们已经使用以下语法对查询应用了一些压缩:
sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'drop_status',1,0)) as bounced,
sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'spam_report',1,0)) as spam_reported
但语法不适用于不同的计数。
对于您发布的块,您可以执行类似的操作来减少连接数。
select
...report_columns...,
SUM(IF(event='sent', unique_event, 0)) as uniqusent
SUM(IF(event='delivered', unique_event, 0)) as uniquedelivered
from [main table]
left join (
select
event,
language,
exact_count_distinct(e.user_id) as uniqueevent
from emailevent e
where country=1 and event in ('sent', 'delivered')
group by event, language
) as sent
能否将要查找的条件提升为子查询中的字段,然后统计字段的不同值?换句话说,类似于:
select
...report_columns...,
t1.uniquesent,
t1.uniquedelivered,
from [main table]
left join (
select
language,
exact_count_distinct(sent) as uniquesent,
exact_count_distinct(users_delivered) as uniquedelivered,
from (
select
language,
if (country=1 and event='sent', e.user_id, null) as sent,
if (country=1 and event='delivered', e.user_id, null) as delivered,
from emailevent e
) group by language
) as t1
如果您使用太多不同的值进行不同的精确计数,这可能会让您进入 resources_exceeded 领域。请注意,如果您将 count distinct 与 bucket count 一起使用,您将获得直到 bucket count 的精确计数。很多时候人们关心的是小的精确数,大了之后近似的就可以了。
在 Google BigQuery 上,我们有一个包含大约 10 列的报告,例如:
+----------------+-----------------+---------------+-------------+
| uniquesent | uniquedelivered | uniquebounced | uniqueopens |
+----------------+-----------------+---------------+-------------+
我们有一个较长的查询,它使用大量连接来计算这些值,大查询大致是这样组织的:
select
...report_columns...,
sent.uniquesent,
delivered.uniquedelivered,
from [main table]
left join (
select
language,
exact_count_distinct(e.user_id) as uniquesent
from emailevent e
where country=1 and event='sent'
group by 1
) as sent
left join (
select
language,
exact_count_distinct(e.user_id) as uniquedelivered
from emailevent e
where country=1 and event='delivered'
group by 1
) as delivered
和这份 JOINs
列表中的其他 10 个类似项目采用相同的样式。还可以想象这个查询包含 group by day/week/month 部分,即使阅读起来也变得非常复杂。我们还收到其中一些的错误消息:超出资源。
我们想将查询重写和优化为 return 相同的数字,但效率更高。如果您有其他问题,请告诉我,但主要是我们想以某种方式消除联接并使其紧凑并性能更好。
我们已经使用以下语法对查询应用了一些压缩:
sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'drop_status',1,0)) as bounced,
sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'spam_report',1,0)) as spam_reported
但语法不适用于不同的计数。
对于您发布的块,您可以执行类似的操作来减少连接数。
select
...report_columns...,
SUM(IF(event='sent', unique_event, 0)) as uniqusent
SUM(IF(event='delivered', unique_event, 0)) as uniquedelivered
from [main table]
left join (
select
event,
language,
exact_count_distinct(e.user_id) as uniqueevent
from emailevent e
where country=1 and event in ('sent', 'delivered')
group by event, language
) as sent
能否将要查找的条件提升为子查询中的字段,然后统计字段的不同值?换句话说,类似于:
select
...report_columns...,
t1.uniquesent,
t1.uniquedelivered,
from [main table]
left join (
select
language,
exact_count_distinct(sent) as uniquesent,
exact_count_distinct(users_delivered) as uniquedelivered,
from (
select
language,
if (country=1 and event='sent', e.user_id, null) as sent,
if (country=1 and event='delivered', e.user_id, null) as delivered,
from emailevent e
) group by language
) as t1
如果您使用太多不同的值进行不同的精确计数,这可能会让您进入 resources_exceeded 领域。请注意,如果您将 count distinct 与 bucket count 一起使用,您将获得直到 bucket count 的精确计数。很多时候人们关心的是小的精确数,大了之后近似的就可以了。