优化大型查询以使用更少的 JOIN,计算不同的参与

Optimizing large query to use less JOINs, count distinct involved

在 Google BigQuery 上,我们有一个包含大约 10 列的报告,例如:

+----------------+-----------------+---------------+-------------+
|     uniquesent | uniquedelivered | uniquebounced | uniqueopens |
+----------------+-----------------+---------------+-------------+

我们有一个较长的查询,它使用大量连接来计算这些值,大查询大致是这样组织的:

select
    ...report_columns...,
   sent.uniquesent,
   delivered.uniquedelivered,
from [main table]
left join (
  select 
    language,
    exact_count_distinct(e.user_id) as uniquesent
   from emailevent e
    where country=1 and event='sent'
   group by 1
) as sent
left join (
  select 
    language,
    exact_count_distinct(e.user_id) as uniquedelivered
   from emailevent e
    where country=1 and event='delivered'
   group by 1
) as delivered

和这份 JOINs 列表中的其他 10 个类似项目采用相同的样式。还可以想象这个查询包含 group by day/week/month 部分,即使阅读起来也变得非常复杂。我们还收到其中一些的错误消息:超出资源。

我们想将查询重写和优化为 return 相同的数字,但效率更高。如果您有其他问题,请告诉我,但主要是我们想以某种方式消除联接并使其紧凑并性能更好。

我们已经使用以下语法对查询应用了一些压缩:

sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'drop_status',1,0)) as bounced,
sum(if(p.country_id=1 AND event = "userblocked" AND JSON_EXTRACT_SCALAR(e.meta,'$.reason') contains 'spam_report',1,0)) as spam_reported

但语法不适用于不同的计数。

对于您发布的块,您可以执行类似的操作来减少连接数。

select
    ...report_columns...,
   SUM(IF(event='sent', unique_event, 0)) as uniqusent
   SUM(IF(event='delivered', unique_event, 0)) as uniquedelivered
from [main table]
left join (
  select
    event,
    language,
    exact_count_distinct(e.user_id) as uniqueevent
   from emailevent e
    where country=1 and event in ('sent', 'delivered')
   group by event, language
) as sent

能否将要查找的条件提升为子查询中的字段,然后统计字段的不同值?换句话说,类似于:

select
    ...report_columns...,
   t1.uniquesent,
   t1.uniquedelivered,
from [main table]
left join (
  select 
    language,
    exact_count_distinct(sent) as uniquesent,
    exact_count_distinct(users_delivered) as uniquedelivered,         
  from (
    select 
      language,
      if (country=1 and event='sent', e.user_id, null) as sent,
      if (country=1 and event='delivered', e.user_id, null) as delivered,
    from emailevent e
  ) group by language
) as t1

如果您使用太多不同的值进行不同的精确计数,这可能会让您进入 resources_exceeded 领域。请注意,如果您将 count distinct 与 bucket count 一起使用,您将获得直到 bucket count 的精确计数。很多时候人们关心的是小的精确数,大了之后近似的就可以了。