JOIN 与 GROUP BY 导致 SUM() 逻辑问题

JOIN with GROUP BY causing SUM() logic issues

查询-

sel TableName, DatabaseName, sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB
        from dbc.tablesize
        group by 1,2
        order by GB desc

结果 -

+-----------+--------+------------+
| TableName | DBName | Size_in_GB |
+-----------+--------+------------+
| WRP       | A      |  28,350.01 |
| CPC       | B      |  19,999.37 |
| SDF       | C      |  13,263.67 |
| DB1400    | D      |  13,200.26 |
+-----------+--------+------------+

从上面的简单查询我可以看到数据库A的tableWRP接近28350国标

现在我正在尝试加入另一个 table dbc.indices 以使用列 IndexType 进行过滤,但现在所有 table 的 Size_in_GB 都发生了变化.

sel a.TableName,a.DatabaseName, sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB from dbc.tablesize a
join dbc.indices b on a.TableName = b.TableName and a.DatabaseName=b.DatabaseName
--where b.indexType='P'
group by 1,2
order by Size_in_GB desc

结果是这样的-

+-----------+--------+------------+
| TableName | DBName | Size_in_GB |
+-----------+--------+------------+
| WRP       | A      |  56,700.02 |
| CPC       | B      |  39,998.74 |
| DB1400    | D      |  39,600.78 |
+-----------+--------+------------+

现在相同的 table 是两倍大小,即 WRP56700 GB。 (其他 tables 类似)

我不确定我用于加入的逻辑有什么问题。

P.S - 我的目标是找到所有大小大于 100GB 且索引类型为 'P' 的 table

编辑 - 分享来自 DBC.INDICES table

的相关专栏
+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+
| DatabaseName | TableName  | IndexNumber | IndexType | UniqueFlag |   IndexName   | ColumnName | ColumnPosition |
+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+
| Some DB      | Some Table |           1 | P         | N          | IndexNamehere | ColumnA    |              1 |
+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+

可能您的密钥在 dbc.indices table 中重复。对于单个 TableNamedbc.indices table 有不止一个条目,因此当您加入 dbc.tablesize table 记录时会重复,因此应用 SUM在重复记录上所以计算错误。

试试这个方法

SELECT a.TableName,
       a.DatabaseName,
       Sum(CurrentPerm / ( 1024 * 1024 * 1024 )) AS Size_in_GB
FROM   dbc.tablesize a
       JOIN (SELECT DISTINCT b.TableName,
                             b.DatabaseName
             FROM   dbc.indices b
             --where b.indexType='P'
             ) b
         ON a.TableName = b.TableName
            AND a.DatabaseName = b.DatabaseName

GROUP  BY a.TableName,
          a.DatabaseName
ORDER  BY Size_in_GB DESC 

什么是混淆?

您显然有 table 个具有多个索引。每个索引都会导致 table 在聚合中出现不止一次。

你想要什么:

My aim is to find all the tables which are greater than 100GB in Size and have indexType as 'P'

我建议将索引比较移动到 where 子句:

select t.TableName, t.DatabaseName,
       sum(tCurrentPerm/(1024*1024*1024)) as Size_in_GB
from dbc.tablesize t
where exists (select 1
              from dbc.indices i
              where t.TableName = i.TableName and t.DatabaseName = i.DatabaseName and
                    i.indexType = 'P'
             )
group by 1,2
order by Size_in_GB desc

如果您还想添加该过滤器,可以在 order by 之前添加 having Size_in_GB > 100

dbc.IndidesV(永远不要使用旧的已弃用的非 V 视图)每个索引每列一行。

您可以简单地添加一个条件以将其限制为单行:where IndexType = 'P' and ColumnPosition = 1

并且进行早期聚合更有效,即在加入之前聚合:

select t.*
from 
 (
   select TableName, DatabaseName,
      sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB
   from dbc.TableSizeV
   group by 1,2
   having Size_in_GB > 100
 ) as dt
join dbc.IndicesV b 
  on a.TableName = b.TableName
 and a.DatabaseName=b.DatabaseName
where IndexType = 'P' 
  and ColumnPosition = 1
order by Size_in_GB desc;

但是为什么要针对那个 IndexType=P 进行过滤,难道您不关心其他大于 100GB 的对象(NoPI/Columnar 表,连接索引)吗?顺便说一句,这并不是 return 所有带有 PI 的表,只有 IndexNumber=1 有。

根据您的需要,您最好加入 dbc.TablesV

P.S - My aim is to find all the tables which are greater than 100GB in Size and have indexType as 'P'

如果您只想查找存在索引的某些 table,则根本不应该加入。请改用 EXISTS。这会将您的条件放在它所属的 WHEREHAVING 子句中,并且您的条件复制记录没有问题(在您的情况下:当 table 有多个匹配索引)。

select tablename, databasename, sum(currentperm/(1024*1024*1024)) as size_in_gb 
from dbc.tablesize ts
group by tablename, databasename
having sum(currentperm/(1024*1024*1024)) > 100
and exists
(
  select *
  from dbc.indices i
  where i.tablename = ts.tablename and i.databasename = ts.databasename
  and i.indexType = 'P'
)
order by Size_in_GB desc;