SQL: Bucket计算结果

Question

我正在从包含交易数据的 table 中提取数据，并希望获得按平均交易规模和账户对数据进行分桶的结果，然后作为显示账户数、交易数总和的列交易规模和平均交易规模。本质上是这样的：

**raw data**                    
date        acct_nr    trans_am         
1/3/2017    1234       400          
1/20/2017   1234       700          
1/22/2017   1234       1100
1/22/2017   2345       300
1/23/2017   2345       800
1/24/2017   3456       1500
1/25/2017   4567       250
1/25/2017   4567       300
1/26/2017   4567       350

**current results**                 
month   tier            acct_ct trans_ct    trans_am    trans_avg
201701  a. >=250 <500   3       5           1600        320
201701  b. >=500 <1000  2       2           1500        750
201701  c. >=1000 <1500 2       2           2600        1300

**expected results**                    
month   tier            acct_ct trans_ct    trans_am    trans_avg (this column should be they key for bucketing, per account)
201701  a. >=250 <500   1       3           900         300
201701  b. >=500 <1000  2       5           3300        660
201701  c. >=1000 <1500 1       1           1500        1500

目前这是我正在使用的脚本，它给我当前结果而不是预期结果:

select
  cldr.year_month
  ,case
    when tran.tran_am >= 0 and tran.tran_am < 100 then 'a. >=0 <100'
    when tran.tran_am >= 100 and tran.tran_am < 250 then 'b. >=100 <250'
    when tran.tran_am >= 250 and tran.tran_am < 500 then 'c. >=250 <500'
    when tran.tran_am >= 500 and tran.tran_am < 1000 then 'd. >=500 <1000'
    when tran.tran_am >= 1000 and tran.tran_am < 1500 then 'e. >=1000 <1500'
    when tran.tran_am >= 1500 and tran.tran_am < 2000 then 'f. >=1500 <2000'
    when tran.tran_am >= 2000 and tran.tran_am < 2500 then 'g. >=2000 <2500'
    when tran.tran_am >= 2500 and tran.tran_am < 5000 then 'h. >=2500 <5000'
    when tran.tran_am >= 5000 and tran.tran_am < 10000 then 'i. >=5000 <10000'
    when tran.tran_am >= 10000 then 'j. >=10000'
    else 'z. other'
    end as trans_am_tier
  ,count(distinct tran.acct_id) as acct_ct
  ,sum(tran.tran_am) as trans_am
  ,count(tran.tran_id) as trans_ct
  ,(trans_am / trans_ct) as trans_avg

  from reports.tran as tran

  inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
  inner join reports.acct as acct on tran.acct_id=acct.acct_id

  where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
  and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
  and tran.prod_type = '4400'
  and acct.acct_stat <> 4
  and acct.dp_cust_nbr NOT IN (1007,1101)

  group by 1,2
  order by 1,2

我知道这与我正在分桶 tran.trans_am 而不是 trans_avg[=31 这一事实有关=].这会通过使用子查询来实现吗？本质上是先计算 trans_avg 然后再分桶？不确定我会怎么做。

基本上，结果应该是"for every account number, count # of transactions and average the transaction amount for those transactions. Then, based on that averaged transaction amount, place that account number with associated transaction count and average transaction size into one of the defined buckets, and then sum the total number of accounts per bucket"。因此，结果应按帐户和交易层分组，分桶应由 trans_avg.
确定
顺便说一句，我是一名分析师，只有对 DBMS 的读取权限。无法创建临时 tables 或任何类似的东西。

添加到原始数据、当前结果和预期结果的编辑，以阐明我正在努力实现的目标。

Answer 1

你是对的，你想要的方法是先聚合数据，然后根据 trans_avg 而不是 tran_am 将聚合记录分配给层。您也可以通过子查询实现这一点，就像这样：

-- Create sample data.
create table [tran]
(
    tran_id bigint,
    acct_id bigint,
    tran_am bigint,
    tran_eff_dt date
);
insert [tran] values
    (1, 1234, 400, '20170103'),
    (2, 1234, 700, '20170120'),
    (3, 1234, 1100, '20170122');

create table calendar
(
    calendar_date date,
    year_month char(6)
);
insert calendar values
    ('20170103', '201701'),
    ('20170120', '201701'),
    ('20170122', '201701');

-- Aggregate transactions first, then assign to a tier.
select
    TransactionsByMonth.year_month,
    case
        when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
        when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
        when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
        when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
        when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
        when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
        when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
        when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
        when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
        when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
        else 'z. other'
    end as trans_am_tier,
    TransactionsByMonth.acct_ct,
    TransactionsByMonth.trans_am,
    TransactionsByMonth.trans_ct,
    TransactionsByMonth.trans_avg
from
    (
        select
            calendar.year_month,
            count(distinct [tran].acct_id) as acct_ct,
            sum([tran].tran_am) as trans_am,
            count([tran].tran_id) as trans_ct,
            sum([tran].tran_am) / count([tran].tran_id) as trans_avg
        from
            [tran]
            inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
        group by
            calendar.year_month
    ) TransactionsByMonth;

请注意，我从您的原始查询中省略了一些连接和 WHERE 子句表达式，只是为了简化重新创建数据集的任务。我还更改了 trans_avg 列的定义，因为我的 DBMS 不允许我根据列表中前面定义的别名定义 SELECT 列表中的一个元素。（我没有 Teradata。）

另一种选择是使用 common table expression 或 CTE。虽然有些事情可以用 CTE 做，但不能用子查询做（比如创建递归查询），在这种情况下，这真的只是个人喜好问题。我更喜欢 CTE，因为我发现它们更易于阅读，尤其是在您需要倍数的情况下；多个嵌套的子查询很快就会变得混乱。 CTE 方法如下所示：

with TransactionsByMonth as
(
    select
        calendar.year_month,
        count(distinct [tran].acct_id) as acct_ct,
        sum([tran].tran_am) as trans_am,
        count([tran].tran_id) as trans_ct,
        sum([tran].tran_am) / count([tran].tran_id) as trans_avg
    from
        [tran]
        inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
    group by
        calendar.year_month
)
select
    TransactionsByMonth.year_month,
    case
        when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
        when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
        when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
        when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
        when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
        when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
        when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
        when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
        when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
        when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
        else 'z. other'
    end as trans_am_tier,
    TransactionsByMonth.acct_ct,
    TransactionsByMonth.trans_am,
    TransactionsByMonth.trans_ct,
    TransactionsByMonth.trans_avg
from
    TransactionsByMonth;

正如我提到的，我没有安装 Teradata，但我认为这里的所有内容都应该是标准的 SQL，所以希望它对你有用，或者至少引导你朝着正确的方向前进。

Answer 2

根据您的叙述，您需要先计算每个帐户的平均值（使用 Derived Table 或 CTE），然后计算每层的行数：

select
/*Then, based on that averaged transaction amount, place that account number with associated transaction count and average transaction size into one of the defined buckets, and then sum the total number of accounts per bucket*/
  cldr.year_month
  ,case -- no need to repeat the lower limit
    when trans_avg >= 0 and trans_avg < 100 then 'a. >=0 <100'
    when trans_avg < 250 then 'b. >=100 <250'
    when trans_avg < 500 then 'c. >=250 <500'
    when trans_avg < 1000 then 'd. >=500 <1000'
    when trans_avg < 1500 then 'e. >=1000 <1500'
    when trans_avg < 2000 then 'f. >=1500 <2000'
    when trans_avg < 2500 then 'g. >=2000 <2500'
    when trans_avg < 5000 then 'h. >=2500 <5000'
    when trans_avg < 10000 then 'i. >=5000 <10000'
    when trans_avg >= 10000 then 'j. >=10000'
    else 'z. other' -- this can only happen for trans_avg < 0
    end as trans_am_tier
   ,count(*)
   ,Sum(trans_ct)
   ,Sum(trans_am)
from
 (
    select
    /*for every account number, count # of transactions and average the transaction amount for those transactions
    */
       cldr.year_month
      ,acct.acct_id
      ,sum(tran.tran_am) as trans_am
      ,count(tran.tran_id) as trans_ct
      ,(trans_am / trans_ct) as trans_avg -- why not a simple avg(trans_am)??
    from reports.tran as tran

      inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
      inner join reports.acct as acct on tran.acct_id=acct.acct_id

    where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
      and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
      and tran.prod_type = '4400'
      and acct.acct_stat <> 4
      and acct.dp_cust_nbr NOT IN (1007,1101)

    group by 1,2
 ) as dt
group by 1,2
order by 1,2

SQL: Bucket计算结果

SQL: Bucket calculated results

sql

aggregate-functions

teradata