TD 错误地估计了通过分组收集的统计信息的行数

Question

这对我来说很奇怪。 TD 高估了行数，实际上是 4200 万，估计是 9.43 亿。

查询很简单：

select ID, sum(amount)
from v_tb  -- view
where REPORT_DATE between Date '2017-11-01' and Date '2017-11-30' -- report_date has date format
group by 1

计划：

  1) First, we lock tb in view v_tb  for access.
  2) Next, we do an all-AMPs SUM step to aggregate from 1230 partitions
     of tb  in view v_tb   with a
     condition of ("(tb.REPORT_DATE >= DATE '2017-11-01') AND
     (tb.REPORT_DATE <= DATE '2017-11-30')")
     , grouping by field1 ( ID).  Aggregate
     Intermediate Results are computed locally, then placed in Spool 1.
     The input table will not be cached in memory, but it is eligible
     for synchronized scanning.  The size of Spool 1 is estimated with
     low confidence to be 943,975,437 rows (27,375,287,673 bytes).  The
     estimated time for this step is 1 minute and 26 seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 1 minute and 26 seconds.

根据ID DBC.statsV收集的统计数据，report_date, (ID, report_date) - 它们都是最新的。没有空值 - TRUE UniqueValueCount 对于 ID，report_date，(ID，report_date) - 3600 万、839、1232 百万值 - 似乎是正确的

为什么TD高估了行数？是不是应该只根据ID的UniqueValueCount得到最终结果，因为我在上面分组

UPD1:

-- estimates 32 mln rows   
   select ID, sum(amount)
    from v_tb  -- view
    where REPORT_DATE between Date '2017-11-01' and Date '2017-11-01' -- report_date has date format
    group by 1

-- estimates 89 mln rows   
   select ID, sum(amount)
    from v_tb  -- view
    where REPORT_DATE between Date '2017-11-01' and Date '2017-11-02' -- report_date has date format
    group by 1

所以问题出在 where 谓词

SampleSizePct 等于 5.01 - 这是否意味着样本量只有 5%？ - 是的

UPD2：之前的查询是更大查询的一部分，如下所示：

select top 100000000
  base.* 
, case when CPE_MODEL_NEW.device_type in ('Smartphone', 'Phone', 'Tablet', 'USB modem') then CPE_MODEL_NEW.device_type
    else 'other' end as device_type
    , usg_mbou
    , usg_arpu_content
    , date '2017-11-30' as max_report_date
    , macroregion_name
from (
select 
  a.SUBS_ID
, a.tac
, MSISDN
, BRANCH_ID
, max(bsegment) bsegment
, max((date '2017-11-30' - cast (activation_dttm as date))/30.4167) as LT_month
, Sum(REVENUE_COMMERCE)  REVENUE_COMMERCE
, max(LAST_FLASH_DTTM) LAST_FLASH_DTTM
    from PRD2_BDS_V2.SUBS_CLR_D a
where a.REPORT_DATE between Date '2017-11-01' and Date '2017-11-30' 
group by 1,2,3,4 --, 8, 9
) base

left join CPE_MODEL_NEW on base.tac = CPE_MODEL_NEW.tac
left join 
(
select SUBS_ID, sum(case when TRAFFIC_TYPE_ID = 4 /*DATA*/ then all_vol / (1024 * 1024) else 0 end) usg_mbou
    ,sum(case when COST_BAND_ID IN (3,46,49,56) then rated_amount else 0 end) usg_arpu_content 
from PRD2_BDS_V2.SUBS_USG_D where SUBS_USG_D.REPORT_DATE between Date '2017-11-01' and Date '2017-11-30'
group by 1
) SUBS_USG_D
  on SUBS_USG_D.SUBS_ID = base.SUBS_ID

  LEFT JOIN PRD2_DIC_V.BRANCH AS BRANCH ON base.BRANCH_ID = BRANCH.BRANCH_ID
  LEFT JOIN PRD2_DIC_V2.REGION AS REGION ON BRANCH.REGION_ID = REGION.REGION_ID  
    AND  Date '2017-11-30' >= REGION.SDATE AND  REGION.EDATE >= Date '2017-11-01'
  LEFT JOIN PRD2_DIC_V2.MACROREGION AS MACROREGION ON REGION.MACROREGION_ID = MACROREGION.MACROREGION_ID 
    AND Date '2017-11-30' >= MACROREGION.SDATE AND  Date '2017-11-01' <= MACROREGION.EDATE

几乎最后几步的假脱机问题查询失败：

We do an All-AMPs STAT FUNCTION step from Spool 10 by way of an all-rows scan into Spool 29, which is redistributed by hash code to all AMPs. The result rows are put into Spool 9, which is redistributed by hash code to all AMPs..

没有产品连接，没有对所有放大器的错误复制，导致阀芯问题。但是还有一个问题，非常高的偏差：

Snapshot CPU skew: 99.7%
Snapshot I/O skew: 99.7%

假脱机仅使用 30 GB，但在查询执行开始时很容易使用超过 300 Gb。

表格没有倾斜

完整说明：

1) First, we lock TELE2_UAT.CPE_MODEL_NEW for access, we lock
     PRD2_DIC.REGION in view PRD2_DIC_V2.REGION for access, we lock
     PRD2_DIC.MACROREGION in view PRD2_DIC_V2.MACROREGION for access,
     we lock PRD2_DIC.BRANCH in view PRD2_DIC_V.BRANCH for access, we
     lock PRD2_BDS.SUBS_CLR_D for access, and we lock
     PRD2_BDS.SUBS_USG_D for access.
  2) Next, we do an all-AMPs SUM step to aggregate from 1230 partitions
     of PRD2_BDS.SUBS_CLR_D with a condition of (
     "(PRD2_BDS.SUBS_CLR_D.REPORT_DATE >= DATE '2017-11-01') AND
     (PRD2_BDS.SUBS_CLR_D.REPORT_DATE <= DATE '2017-11-30')"), and the
     grouping identifier in field 1.  Aggregate Intermediate Results
     are computed locally,skipping sort when applicable, then placed in
     Spool 4.  The input table will not be cached in memory, but it is
     eligible for synchronized scanning.  The size of Spool 4 is
     estimated with low confidence to be 1,496,102,647 rows (
     285,755,605,577 bytes).  The estimated time for this step is 1
     minute and 55 seconds.
  3) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by
          way of an all-rows scan into Spool 2 (used to materialize
          view, derived table, table function or table operator base)
          (all_amps) (compressed columns allowed), which is built
          locally on the AMPs with Field1 ("UniqueId").  The size of
          Spool 2 is estimated with low confidence to be 1,496,102,647
          rows (140,633,648,818 bytes).  Spool AsgnList:
          "Field_1" = "UniqueId",
          "Field_2" = "SUBS_ID",
          "Field_3" = "TAC",
          "Field_4" = "MSISDN",
          "Field_5" = "BRANCH_ID",
          "Field_6" = "Field_6",
          "Field_7" = "Field_7",
          "Field_8" = "Field_8",
          "Field_9" = "Field_9".
          The estimated time for this step is 57.85 seconds.
       2) We do an all-AMPs SUM step to aggregate from 1230 partitions
          of PRD2_BDS.SUBS_USG_D with a condition of ("(NOT
          (PRD2_BDS.SUBS_USG_D.SUBS_ID IS NULL )) AND
          ((PRD2_BDS.SUBS_USG_D.REPORT_DATE >= DATE '2017-11-01') AND
          (PRD2_BDS.SUBS_USG_D.REPORT_DATE <= DATE '2017-11-30'))"),
          and the grouping identifier in field 1.  Aggregate
          Intermediate Results are computed locally,skipping sort when
          applicable, then placed in Spool 7.  The input table will not
          be cached in memory, but it is eligible for synchronized
          scanning.  The size of Spool 7 is estimated with low
          confidence to be 943,975,437 rows (42,478,894,665 bytes).
          The estimated time for this step is 1 minute and 29 seconds.
  4) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from Spool 7 (Last Use) by
          way of an all-rows scan into Spool 1 (used to materialize
          view, derived table, table function or table operator
          SUBS_USG_D) (all_amps) (compressed columns allowed), which is
          built locally on the AMPs with Field1 ("UniqueId").  The size
          of Spool 1 is estimated with low confidence to be 943,975,437
          rows (42,478,894,665 bytes).  Spool AsgnList:
          "Field_1" = "UniqueId",
          "Field_2" = "SUBS_ID",
          "Field_3" = "Field_3",
          "Field_4" = "Field_4".
          The estimated time for this step is 16.75 seconds.
       2) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by
          way of an all-rows scan into Spool 11 (all_amps) (compressed
          columns allowed), which is redistributed by hash code to all
          AMPs to all AMPs with hash fields ("Spool_2.SUBS_ID").  Then
          we do a SORT to order Spool 11 by row hash.  The size of
          Spool 11 is estimated with low confidence to be 1,496,102,647
          rows (128,664,827,642 bytes).  Spool AsgnList:
          "SUBS_ID" = "Spool_2.SUBS_ID",
          "TAC" = "TAC",
          "MSISDN" = "MSISDN",
          "BRANCH_ID" = "BRANCH_ID",
          "BSEGMENT" = "BSEGMENT",
          "LT_MONTH" = "LT_MONTH",
          "REVENUE_COMMERCE" = "REVENUE_COMMERCE",
          "LAST_FLASH_DTTM" = "LAST_FLASH_DTTM".
          The estimated time for this step is 4 minutes and 8 seconds.
  5) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from Spool 1 (Last Use) by
          way of an all-rows scan into Spool 12 (all_amps) (compressed
          columns allowed), which is redistributed by hash code to all
          AMPs to all AMPs with hash fields ("Spool_1.SUBS_ID").  Then
          we do a SORT to order Spool 12 by row hash.  The size of
          Spool 12 is estimated with low confidence to be 943,975,437
          rows (34,927,091,169 bytes).  Spool AsgnList:
          "SUBS_ID" = "Spool_1.SUBS_ID",
          "USG_MBOU" = "USG_MBOU",
          "USG_ARPU_CONTENT" = "USG_ARPU_CONTENT".
          The estimated time for this step is 1 minute and 5 seconds.
       2) We do an all-AMPs RETRIEVE step from PRD2_DIC.BRANCH in view
          PRD2_DIC_V.BRANCH by way of an all-rows scan with a condition
          of ("NOT (PRD2_DIC.BRANCH in view PRD2_DIC_V.BRANCH.BRANCH_ID
          IS NULL)") into Spool 13 (all_amps) (compressed columns
          allowed), which is redistributed by hash code to all AMPs to
          all AMPs with hash fields ("PRD2_DIC.BRANCH.REGION_ID").
          Then we do a SORT to order Spool 13 by row hash.  The size of
          Spool 13 is estimated with high confidence to be 107 rows (
          1,712 bytes).  Spool AsgnList:
          "BRANCH_ID" = "BRANCH_ID",
          "REGION_ID" = "PRD2_DIC.BRANCH.REGION_ID".
          The estimated time for this step is 0.02 seconds.
  6) We execute the following steps in parallel.
       1) We do an all-AMPs JOIN step (No Sum) from PRD2_DIC.REGION in
          view PRD2_DIC_V2.REGION by way of a RowHash match scan with a
          condition of ("(PRD2_DIC.REGION in view
          PRD2_DIC_V2.REGION.EDATE >= DATE '2017-11-01') AND
          (PRD2_DIC.REGION in view PRD2_DIC_V2.REGION.SDATE <= DATE
          '2017-11-30')"), which is joined to Spool 13 (Last Use) by
          way of a RowHash match scan.  PRD2_DIC.REGION and Spool 13
          are right outer joined using a merge join, with condition(s)
          used for non-matching on right table ("NOT
          (Spool_13.REGION_ID IS NULL)"), with a join condition of (
          "Spool_13.REGION_ID = PRD2_DIC.REGION.ID").  The result goes
          into Spool 14 (all_amps) (compressed columns allowed), which
          is redistributed by hash code to all AMPs to all AMPs with
          hash fields ("PRD2_DIC.REGION.MACROREGION_CODE").  Then we do
          a SORT to order Spool 14 by row hash.  The size of Spool 14
          is estimated with low confidence to be 107 rows (2,461 bytes).
          Spool AsgnList:
          "MACROREGION_CODE" = "PRD2_DIC.REGION.MACROREGION_CODE",
          "BRANCH_ID" = "{RightTable}.BRANCH_ID".
          The estimated time for this step is 0.03 seconds.
       2) We do an all-AMPs RETRIEVE step from TELE2_UAT.CPE_MODEL_NEW
          by way of an all-rows scan with no residual conditions into
          Spool 17 (all_amps) (compressed columns allowed), which is
          duplicated on all AMPs with hash fields (
          "TELE2_UAT.CPE_MODEL_NEW.TAC").  Then we do a SORT to order
          Spool 17 by row hash.  The size of Spool 17 is estimated with
          high confidence to be 49,024,320 rows (2,696,337,600 bytes).
          Spool AsgnList:
          "TAC" = "TELE2_UAT.CPE_MODEL_NEW.TAC",
          "DEVICE_TYPE" = "DEVICE_TYPE".
          The estimated time for this step is 2.81 seconds.
       3) We do an all-AMPs JOIN step (No Sum) from Spool 11 (Last Use)
          by way of a RowHash match scan, which is joined to Spool 12
          (Last Use) by way of a RowHash match scan.  Spool 11 and
          Spool 12 are left outer joined using a merge join, with
          condition(s) used for non-matching on left table ("NOT
          (Spool_11.SUBS_ID IS NULL)"), with a join condition of (
          "Spool_12.SUBS_ID = Spool_11.SUBS_ID").  The result goes into
          Spool 18 (all_amps) (compressed columns allowed), which is
          built locally on the AMPs with hash fields ("Spool_11.TAC").
          Then we do a SORT to order Spool 18 by row hash.  The size of
          Spool 18 is estimated with low confidence to be 1,496,102,648
          rows (152,602,470,096 bytes).  Spool AsgnList:
          "BRANCH_ID" = "{LeftTable}.BRANCH_ID",
          "TAC" = "Spool_11.TAC",
          "SUBS_ID" = "{LeftTable}.SUBS_ID",
          "MSISDN" = "{LeftTable}.MSISDN",
          "BSEGMENT" = "{LeftTable}.BSEGMENT",
          "LT_MONTH" = "{LeftTable}.LT_MONTH",
          "REVENUE_COMMERCE" = "{LeftTable}.REVENUE_COMMERCE",
          "LAST_FLASH_DTTM" = "{LeftTable}.LAST_FLASH_DTTM",
          "USG_MBOU" = "{RightTable}.USG_MBOU",
          "USG_ARPU_CONTENT" = "{RightTable}.USG_ARPU_CONTENT".
          The estimated time for this step is 3 minutes and 45 seconds.
  7) We execute the following steps in parallel.
       1) We do an all-AMPs JOIN step (No Sum) from
          PRD2_DIC.MACROREGION in view PRD2_DIC_V2.MACROREGION by way
          of a RowHash match scan with a condition of (
          "(PRD2_DIC.MACROREGION in view PRD2_DIC_V2.MACROREGION.EDATE
          >= DATE '2017-11-01') AND (PRD2_DIC.MACROREGION in view
          PRD2_DIC_V2.MACROREGION.SDATE <= DATE '2017-11-30')"), which
          is joined to Spool 14 (Last Use) by way of a RowHash match
          scan.  PRD2_DIC.MACROREGION and Spool 14 are right outer
          joined using a merge join, with condition(s) used for
          non-matching on right table ("NOT (Spool_14.MACROREGION_CODE
          IS NULL)"), with a join condition of (
          "Spool_14.MACROREGION_CODE = PRD2_DIC.MACROREGION.MR_CODE").
          The result goes into Spool 19 (all_amps) (compressed columns
          allowed), which is duplicated on all AMPs with hash fields (
          "Spool_14.BRANCH_ID").  The size of Spool 19 is estimated
          with low confidence to be 34,240 rows (1,712,000 bytes).
          Spool AsgnList:
          "BRANCH_ID" = "Spool_14.BRANCH_ID",
          "MR_NAME" = "{LeftTable}.MR_NAME".
          The estimated time for this step is 0.04 seconds.
       2) We do an all-AMPs JOIN step (No Sum) from Spool 17 (Last Use)
          by way of a RowHash match scan, which is joined to Spool 18
          (Last Use) by way of a RowHash match scan.  Spool 17 and
          Spool 18 are right outer joined using a merge join, with
          condition(s) used for non-matching on right table ("NOT
          (Spool_18.TAC IS NULL)"), with a join condition of (
          "Spool_18.TAC = Spool_17.TAC").  The result goes into Spool
          22 (all_amps) (compressed columns allowed), which is built
          locally on the AMPs with hash fields ("Spool_18.BRANCH_ID").
          The size of Spool 22 is estimated with low confidence to be
          1,496,102,648 rows (204,966,062,776 bytes).  Spool AsgnList:
          "BRANCH_ID" = "Spool_18.BRANCH_ID",
          "SUBS_ID" = "{RightTable}.SUBS_ID",
          "TAC" = "{RightTable}.TAC",
          "MSISDN" = "{RightTable}.MSISDN",
          "BSEGMENT" = "{RightTable}.BSEGMENT",
          "LT_MONTH" = "{RightTable}.LT_MONTH",
          "REVENUE_COMMERCE" = "{RightTable}.REVENUE_COMMERCE",
          "LAST_FLASH_DTTM" = "{RightTable}.LAST_FLASH_DTTM",
          "DEVICE_TYPE" = "{LeftTable}.DEVICE_TYPE",
          "USG_MBOU" = "{RightTable}.USG_MBOU",
          "USG_ARPU_CONTENT" = "{RightTable}.USG_ARPU_CONTENT".
          The estimated time for this step is 1 minute and 23 seconds.
  8) We do an all-AMPs JOIN step (No Sum) from Spool 19 (Last Use) by
     way of an all-rows scan, which is joined to Spool 22 (Last Use) by
     way of an all-rows scan.  Spool 19 is used as the hash table and
     Spool 22 is used as the probe table in a right outer joined using
     a single partition classical hash join, with condition(s) used for
     non-matching on right table ("NOT (Spool_22.BRANCH_ID IS NULL)"),
     with a join condition of ("Spool_22.BRANCH_ID = Spool_19.BRANCH_ID").
     The result goes into Spool 10 (all_amps) (compressed columns
     allowed), which is built locally on the AMPs with Field1 ("28364").
     The size of Spool 10 is estimated with low confidence to be
     1,496,102,648 rows (260,321,860,752 bytes).  Spool AsgnList:
     "Field_1" = "28364",
     "Spool_10.SUBS_ID" = "{ Copy }{RightTable}.SUBS_ID",
     "Spool_10.TAC" = "{ Copy }{RightTable}.TAC",
     "Spool_10.MSISDN" = "{ Copy }{RightTable}.MSISDN",
     "Spool_10.BRANCH_ID" = "{ Copy }{RightTable}.BRANCH_ID",
     "Spool_10.BSEGMENT" = "{ Copy }{RightTable}.BSEGMENT",
     "Spool_10.LT_MONTH" = "{ Copy }{RightTable}.LT_MONTH",
     "Spool_10.REVENUE_COMMERCE" = "{ Copy
     }{RightTable}.REVENUE_COMMERCE",
     "Spool_10.LAST_FLASH_DTTM" = "{ Copy }{RightTable}.LAST_FLASH_DTTM",
     "Spool_10.DEVICE_TYPE" = "{ Copy }{RightTable}.DEVICE_TYPE",
     "Spool_10.USG_MBOU" = "{ Copy }{RightTable}.USG_MBOU",
     "Spool_10.USG_ARPU_CONTENT" = "{ Copy
     }{RightTable}.USG_ARPU_CONTENT",
     "Spool_10.MR_NAME" = "{ Copy }{LeftTable}.MR_NAME".
     The estimated time for this step is 1 minute and 45 seconds.
  9) We do an all-AMPs STAT FUNCTION step from Spool 10 by way of an
     all-rows scan into Spool 29, which is redistributed by hash code
     to all AMPs.  The result rows are put into Spool 9 (group_amps),
     which is built locally on the AMPs with Field1 ("Field_1").  This
     step is used to retrieve the TOP 100000000 rows.  Load
     distribution optimization is used. If this step retrieves less
     than 100000000 rows, then execute step 10.  The size is estimated
     with low confidence to be 100,000,000 rows (25,000,000,000 bytes).
 10) We do an all-AMPs STAT FUNCTION step from Spool 10 (Last Use) by
     way of an all-rows scan into Spool 29 (Last Use), which is
     redistributed by hash code to all AMPs.  The result rows are put
     into Spool 9 (group_amps), which is built locally on the AMPs with
     Field1 ("Field_1").  This step is used to retrieve the TOP
     100000000 rows.  The size is estimated with low confidence to be
     100,000,000 rows (25,000,000,000 bytes).
 11) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 9 are sent back to the user as the result of
     statement 1.

我可以在这里做什么？

Answer 1

大多数数据库显示错误的估计，这没关系，只要这些估计之间的关系足够好，可以生成合适的执行计划。

现在，如果你认为执行计划是错误的，那么你应该认真关心那些估计。您最近是否更新了表格统计信息？

不然我也不会太在意了

TD 错误地估计了通过分组收集的统计信息的行数

TD wrongly estimates amount of rows with collected statistics with grouping

sql

statistics

query-optimization

teradata