基于 GROUP BY 结果的多重自连接

Question

我正在尝试从备份设备 (Avamar) 上的 ProgreSQL 数据库 table 收集有关备份 activity 的详细信息。 table 有几个列，包括：client_name、数据集、plugin_name、类型、completed_ts、status_code、bytes_modified 等。简化示例：

| session_id | client_name | dataset |         plugin_name |             type |         completed_ts | status_code | bytes_modified |
|------------|-------------|---------|---------------------|------------------|----------------------|-------------|----------------|
|          1 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-05T01:00:00Z |       30900 |       11111111 |
|          2 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-04T01:00:00Z |       30000 |       22222222 |
|          3 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-03T01:00:00Z |       30000 |       22222222 |
|          4 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-02T01:00:00Z |       30000 |       22222222 |
|          5 |    server01 | Windows |         Windows VSS | Scheduled Backup | 2017-12-01T01:00:00Z |       30000 |       33333333 |
|          6 |    server02 | Windows | Windows File System | Scheduled Backup | 2017-12-05T02:00:00Z |       30000 |       44444444 |
|          7 |    server02 | Windows | Windows File System | Scheduled Backup | 2017-12-04T02:00:00Z |       30900 |       55555555 |
|          8 |    server03 | Windows | Windows File System | On-Demand Backup | 2017-12-05T03:00:00Z |       30000 |       66666666 |
|          9 |    server04 | Windows | Windows File System |         Validate | 2017-12-05T03:00:00Z |       30000 |       66666666 |

每个client_name（服务器）可以有多个数据集，每个数据集可以有多个plugin_name。因此，我创建了一个 SQL 语句，该语句对这三列进行 GROUP BY 以获得随时间变化的 "job" activity 列表。 (http://sqlfiddle.com/#!15/f15556/1)

select
  client_name,
  dataset,
  plugin_name
from v_activities_2
where
  type like '%Backup%'
group by
  client_name, dataset, plugin_name

这些作业中的每一个都可以根据 status_code 列成功或失败。将自连接与子查询结合使用，我能够获得 Last Good 备份的结果及其 completed_ts（完成时间）和 bytes_modified 等： (http://sqlfiddle.com/#!15/f15556/16)

select
  a2.client_name,
  a2.dataset,
  a2.plugin_name,
  a2.LastGood,
  a3.status_code,
  a3.bytes_modified as LastGood_bytes
from v_activities_2 a3

join (
  select
    client_name,
    dataset,
    plugin_name,
    max(completed_ts) as LastGood
  from v_activities_2 a2
  where
    type like '%Backup%'
    and status_code in (30000,30005)   -- Successful (Good) Status codes
  group by
    client_name, dataset, plugin_name
) as a2
on a3.client_name  = a2.client_name and
   a3.dataset      = a2.dataset and
   a3.plugin_name  = a2.plugin_name and
   a3.completed_ts = a2.LastGood

我可以通过删除 WHERE 的 status_code 行：http://sqlfiddle.com/#!15/f15556/3 来单独执行相同的操作以获取最后一次尝试的详细信息。请注意，大多数时候 LastGood 和 LastAttempted 是同一行，但有时它们不是，这取决于上次备份是否成功。

我遇到的问题是将这两个语句合并在一起（如果可能）。所以我会得到这个结果：

| client_name | dataset |         plugin_name |             lastgood |  lastgood_bytes |          lastattempt | lastattempt_bytes |
|-------------|---------|---------------------|----------------------|-----------------|----------------------|-------------------|
|    server01 | Windows | Windows File System | 2017-12-04T01:00:00Z |        22222222 | 2017-12-05T01:00:00Z |          11111111 |
|    server01 | Windows |         Windows VSS | 2017-12-01T01:00:00Z |        33333333 | 2017-12-01T01:00:00Z |          33333333 |
|    server02 | Windows | Windows File System | 2017-12-05T02:00:00Z |        44444444 | 2017-12-05T02:00:00Z |          44444444 |
|    server03 | Windows | Windows File System | 2017-12-05T03:00:00Z |        66666666 | 2017-12-05T03:00:00Z |          66666666 |

我试图在末尾添加另一个 RIGHT JOIN (http://sqlfiddle.com/#!15/f15556/4) 并获得 NULL 行。在做了一些阅读之后，我看到前两个 JOIN 运行在第二次连接发生之前首先创建了一个临时 table，但那时我需要的数据丢失了，所以我得到了 NULL 行。

通过 groovy 脚本使用 PostgreSQL 8。我也只有对数据库的只读访问权限。

Answer 1

您显然有两个中间 inner join 输出 table，并且您希望从每个输出中获取关于某些由公共键标识的内容的列。所以 inner join 他们在钥匙上。

select
  g.client_name,
  g.dataset,
  g.plugin_name,
  LastGood,
  g.status_code,
  LastGood_bytes
  LastAttempt,
  l.status_code,
  LastAttempt_bytes
from
( -- cut & pasted Last Good http://sqlfiddle.com/#!15/f15556/16
    select
      a2.client_name,
      a2.dataset,
      a2.plugin_name,
      a2.LastGood,
      a3.status_code,
      a3.bytes_modified as LastGood_bytes
    from v_activities_2 a3
    join (
      select
        client_name,
        dataset,
        plugin_name,
        max(completed_ts) as LastGood
      from v_activities_2 a2
      where
        type like '%Backup%'
        and status_code in (30000,30005)   -- Successful (Good) Status codes
      group by
        client_name, dataset, plugin_name
    ) as a2
    on a3.client_name  = a2.client_name and
       a3.dataset      = a2.dataset and
       a3.plugin_name  = a2.plugin_name and
       a3.completed_ts = a2.LastGood
) as g
join 
( -- cut & pasted Last Attempt http://sqlfiddle.com/#!15/f15556/3
    select
      a1.client_name,
      a1.dataset,
      a1.plugin_name,
      a1.LastAttempt,
      a3.status_code,
      a3.bytes_modified as LastAttempt_bytes
    from v_activities_2 a3
    join (
      select
        client_name,
        dataset,
        plugin_name,
        max(completed_ts) as LastAttempt
      from v_activities_2 a2
      where
        type like '%Backup%'
      group by
        client_name, dataset, plugin_name
    ) as a1
    on a3.client_name  = a1.client_name and
       a3.dataset      = a1.dataset and
       a3.plugin_name  = a1.plugin_name and
       a3.completed_ts = a1.LastAttempt
) as l
on l.client_name  = g.client_name and
   l.dataset      = g.dataset and
   l.plugin_name  = g.plugin_name
order by client_name, dataset, plugin_name

这使用了中的一种适用方法。然而，代码块的对应关系可能不是那么清楚。它的中间值是 left 与你的 inner & group_concat 是你的 max。（但由于 group_concat 及其查询的细节，它有更多方法。）

A correct symmetrical INNER JOIN approach: LEFT JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT (which is what your first query did); then separately similarly LEFT JOIN q1 & q3--1:many--then GROUP BY & GROUP_CONCAT; then INNER JOIN the two results ON user_id--1:1.

A correct cumulative LEFT JOIN approach: JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT; then left join that & q3--1:many--then GROUP BY & GROUP_CONCAT.

这是否真的符合您的目的通常取决于您的实际规范和限制。即使您 link 中的两个 join 是您想要的，您也需要准确解释 "merge" 的含义。如果 join 对分组的列有不同的值集，你就不会说出你想要什么。强迫自己使用英语根据输入中的行说出结果中的行。

PS 1 您有 undocumented/undeclared/unenforced 限制条件。请尽可能申报。否则由触发器强制执行。如果不在代码中，请记录有问题的文本。约束是 join 和 group by.

中多个子行值实例的基础

PS 2 学习 syntax/semantics for select。了解 left/right outer join ons return--inner join on 的作用加上不匹配的 left/right table 由 [= 扩展的行27=]s.

PS 3 Is there any rule of thumb to construct SQL query from a human-readable description?

Answer 2

这是另一种方法，它也有效但更难遵循并且可能更适合我的数据集：http://sqlfiddle.com/#!15/f15556/114

select
  Actvty.client_name,
  Actvty.dataset,
  Actvty.plugin_name,
  ActvtyGood.LastGood,
  ActvtyGood.status_code as LastGood_status,
  ActvtyGood.bytes_modified as LastGood_bytes,
  ActvtyOnly.LastAttempt,
  Actvty.status_code as LastAttempt_status,
  Actvty.bytes_modified as LastAttempt_bytes
from v_activities_2 Actvty

-- 1. Get last attempt of each job (which may or may not match last good)
join (
  select
    client_name,
    dataset,
    plugin_name,
    max(completed_ts) as LastAttempt
  from v_activities_2
  where
    type like '%Backup%'
  group by
    client_name, dataset, plugin_name
) as ActvtyOnly
on Actvty.client_name  = ActvtyOnly.client_name and
   Actvty.dataset      = ActvtyOnly.dataset and
   Actvty.plugin_name  = ActvtyOnly.plugin_name and
   Actvty.completed_ts = ActvtyOnly.LastAttempt

-- 4. join the list of good runs with the table of last attempts, there would never be a job that has a last good without also a last attempt.
join (

  -- 3. join last good runs with the full table to get the additional details of each
  select
    ActvtyGoodSub.client_name,
    ActvtyGoodSub.dataset,
    ActvtyGoodSub.plugin_name,
    ActvtyGoodSub.LastGood,
    ActvtyAll.status_code,
    ActvtyAll.bytes_modified
  from v_activities_2 ActvtyAll

  -- 2. Get last Good run of each job
  join (
    select
      client_name,
      dataset,
      plugin_name,
      max(completed_ts) as LastGood
    from v_activities_2
    where
      type like '%Backup%'
      and status_code in (30000,30005)   -- Successful (Good) Status codes
    group by
      client_name, dataset, plugin_name
  ) as ActvtyGoodSub
  on ActvtyAll.client_name  = ActvtyGoodSub.client_name and
     ActvtyAll.dataset      = ActvtyGoodSub.dataset and
     ActvtyAll.plugin_name  = ActvtyGoodSub.plugin_name and
     ActvtyAll.completed_ts = ActvtyGoodSub.LastGood

) as ActvtyGood
on Actvty.client_name  = ActvtyGood.client_name and
   Actvty.dataset      = ActvtyGood.dataset and
   Actvty.plugin_name  = ActvtyGood.plugin_name

基于 GROUP BY 结果的多重自连接

Multiple Self-Join based on GROUP BY results

sql

postgresql

group-by

subquery

self-join