JOIN 与 GROUP BY 导致 SUM() 逻辑问题

Question

查询-

sel TableName, DatabaseName, sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB
        from dbc.tablesize
        group by 1,2
        order by GB desc

结果 -

+-----------+--------+------------+
| TableName | DBName | Size_in_GB |
+-----------+--------+------------+
| WRP       | A      |  28,350.01 |
| CPC       | B      |  19,999.37 |
| SDF       | C      |  13,263.67 |
| DB1400    | D      |  13,200.26 |
+-----------+--------+------------+

从上面的简单查询我可以看到数据库A的tableWRP接近28350国标

现在我正在尝试加入另一个 table dbc.indices 以使用列 IndexType 进行过滤，但现在所有 table 的 Size_in_GB 都发生了变化.

sel a.TableName,a.DatabaseName, sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB from dbc.tablesize a
join dbc.indices b on a.TableName = b.TableName and a.DatabaseName=b.DatabaseName
--where b.indexType='P'
group by 1,2
order by Size_in_GB desc

结果是这样的-

+-----------+--------+------------+
| TableName | DBName | Size_in_GB |
+-----------+--------+------------+
| WRP       | A      |  56,700.02 |
| CPC       | B      |  39,998.74 |
| DB1400    | D      |  39,600.78 |
+-----------+--------+------------+

现在相同的 table 是两倍大小，即 WRP 是 56700 GB。（其他 tables 类似）

我不确定我用于加入的逻辑有什么问题。

P.S - 我的目标是找到所有大小大于 100GB 且索引类型为 'P' 的 table

编辑 - 分享来自 DBC.INDICES table

的相关专栏

+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+
| DatabaseName | TableName  | IndexNumber | IndexType | UniqueFlag |   IndexName   | ColumnName | ColumnPosition |
+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+
| Some DB      | Some Table |           1 | P         | N          | IndexNamehere | ColumnA    |              1 |
+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+

Answer 1

可能您的密钥在 dbc.indices table 中重复。对于单个 TableName ，dbc.indices table 有不止一个条目，因此当您加入 dbc.tablesize table 记录时会重复，因此应用 SUM在重复记录上所以计算错误。

试试这个方法

SELECT a.TableName,
       a.DatabaseName,
       Sum(CurrentPerm / ( 1024 * 1024 * 1024 )) AS Size_in_GB
FROM   dbc.tablesize a
       JOIN (SELECT DISTINCT b.TableName,
                             b.DatabaseName
             FROM   dbc.indices b
             --where b.indexType='P'
             ) b
         ON a.TableName = b.TableName
            AND a.DatabaseName = b.DatabaseName

GROUP  BY a.TableName,
          a.DatabaseName
ORDER  BY Size_in_GB DESC

Answer 2

什么是混淆？

您显然有 table 个具有多个索引。每个索引都会导致 table 在聚合中出现不止一次。

你想要什么：

My aim is to find all the tables which are greater than 100GB in Size and have indexType as 'P'

我建议将索引比较移动到 where 子句：

select t.TableName, t.DatabaseName,
       sum(tCurrentPerm/(1024*1024*1024)) as Size_in_GB
from dbc.tablesize t
where exists (select 1
              from dbc.indices i
              where t.TableName = i.TableName and t.DatabaseName = i.DatabaseName and
                    i.indexType = 'P'
             )
group by 1,2
order by Size_in_GB desc

如果您还想添加该过滤器，可以在 order by 之前添加 having Size_in_GB > 100。

Answer 3

dbc.IndidesV（永远不要使用旧的已弃用的非 V 视图）每个索引每列一行。

您可以简单地添加一个条件以将其限制为单行：where IndexType = 'P' and ColumnPosition = 1

并且进行早期聚合更有效，即在加入之前聚合：

select t.*
from 
 (
   select TableName, DatabaseName,
      sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB
   from dbc.TableSizeV
   group by 1,2
   having Size_in_GB > 100
 ) as dt
join dbc.IndicesV b 
  on a.TableName = b.TableName
 and a.DatabaseName=b.DatabaseName
where IndexType = 'P' 
  and ColumnPosition = 1
order by Size_in_GB desc;

但是为什么要针对那个 IndexType=P 进行过滤，难道您不关心其他大于 100GB 的对象（NoPI/Columnar 表，连接索引）吗？顺便说一句，这并不是 return 所有带有 PI 的表，只有 IndexNumber=1 有。

根据您的需要，您最好加入 dbc.TablesV。

Answer 4

P.S - My aim is to find all the tables which are greater than 100GB in Size and have indexType as 'P'

如果您只想查找存在索引的某些 table，则根本不应该加入。请改用 EXISTS。这会将您的条件放在它所属的 WHERE 或 HAVING 子句中，并且您的条件复制记录没有问题（在您的情况下：当 table 有多个匹配索引）。

select tablename, databasename, sum(currentperm/(1024*1024*1024)) as size_in_gb 
from dbc.tablesize ts
group by tablename, databasename
having sum(currentperm/(1024*1024*1024)) > 100
and exists
(
  select *
  from dbc.indices i
  where i.tablename = ts.tablename and i.databasename = ts.databasename
  and i.indexType = 'P'
)
order by Size_in_GB desc;

JOIN 与 GROUP BY 导致 SUM() 逻辑问题

JOIN with GROUP BY causing SUM() logic issues

sql

join

teradata