SQL: Bucket计算结果
SQL: Bucket calculated results
我正在从包含交易数据的 table 中提取数据,并希望获得按平均交易规模和账户对数据进行分桶的结果,然后作为显示账户数、交易数总和的列交易规模和平均交易规模。本质上是这样的:
**raw data**
date acct_nr trans_am
1/3/2017 1234 400
1/20/2017 1234 700
1/22/2017 1234 1100
1/22/2017 2345 300
1/23/2017 2345 800
1/24/2017 3456 1500
1/25/2017 4567 250
1/25/2017 4567 300
1/26/2017 4567 350
**current results**
month tier acct_ct trans_ct trans_am trans_avg
201701 a. >=250 <500 3 5 1600 320
201701 b. >=500 <1000 2 2 1500 750
201701 c. >=1000 <1500 2 2 2600 1300
**expected results**
month tier acct_ct trans_ct trans_am trans_avg (this column should be they key for bucketing, per account)
201701 a. >=250 <500 1 3 900 300
201701 b. >=500 <1000 2 5 3300 660
201701 c. >=1000 <1500 1 1 1500 1500
目前这是我正在使用的脚本,它给我当前结果而不是预期结果:
select
cldr.year_month
,case
when tran.tran_am >= 0 and tran.tran_am < 100 then 'a. >=0 <100'
when tran.tran_am >= 100 and tran.tran_am < 250 then 'b. >=100 <250'
when tran.tran_am >= 250 and tran.tran_am < 500 then 'c. >=250 <500'
when tran.tran_am >= 500 and tran.tran_am < 1000 then 'd. >=500 <1000'
when tran.tran_am >= 1000 and tran.tran_am < 1500 then 'e. >=1000 <1500'
when tran.tran_am >= 1500 and tran.tran_am < 2000 then 'f. >=1500 <2000'
when tran.tran_am >= 2000 and tran.tran_am < 2500 then 'g. >=2000 <2500'
when tran.tran_am >= 2500 and tran.tran_am < 5000 then 'h. >=2500 <5000'
when tran.tran_am >= 5000 and tran.tran_am < 10000 then 'i. >=5000 <10000'
when tran.tran_am >= 10000 then 'j. >=10000'
else 'z. other'
end as trans_am_tier
,count(distinct tran.acct_id) as acct_ct
,sum(tran.tran_am) as trans_am
,count(tran.tran_id) as trans_ct
,(trans_am / trans_ct) as trans_avg
from reports.tran as tran
inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
inner join reports.acct as acct on tran.acct_id=acct.acct_id
where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
and tran.prod_type = '4400'
and acct.acct_stat <> 4
and acct.dp_cust_nbr NOT IN (1007,1101)
group by 1,2
order by 1,2
我知道这与我正在分桶 tran.trans_am 而不是 trans_avg[=31 这一事实有关=].这会通过使用子查询来实现吗?本质上是先计算 trans_avg 然后再分桶?不确定我会怎么做。
基本上,结果应该是"for every account number, count # of transactions and average the transaction amount for those transactions. Then, based on that averaged transaction amount, place that account number with associated transaction count and average transaction size into one of the defined buckets, and then sum the total number of accounts per bucket"。因此,结果应按帐户和交易层分组,分桶应由 trans_avg.
确定
顺便说一句,我是一名分析师,只有对 DBMS 的读取权限。无法创建临时 tables 或任何类似的东西。
添加到原始数据、当前结果和预期结果的编辑,以阐明我正在努力实现的目标。
你是对的,你想要的方法是先聚合数据,然后根据 trans_avg
而不是 tran_am
将聚合记录分配给层。您也可以通过子查询实现这一点,就像这样:
-- Create sample data.
create table [tran]
(
tran_id bigint,
acct_id bigint,
tran_am bigint,
tran_eff_dt date
);
insert [tran] values
(1, 1234, 400, '20170103'),
(2, 1234, 700, '20170120'),
(3, 1234, 1100, '20170122');
create table calendar
(
calendar_date date,
year_month char(6)
);
insert calendar values
('20170103', '201701'),
('20170120', '201701'),
('20170122', '201701');
-- Aggregate transactions first, then assign to a tier.
select
TransactionsByMonth.year_month,
case
when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
else 'z. other'
end as trans_am_tier,
TransactionsByMonth.acct_ct,
TransactionsByMonth.trans_am,
TransactionsByMonth.trans_ct,
TransactionsByMonth.trans_avg
from
(
select
calendar.year_month,
count(distinct [tran].acct_id) as acct_ct,
sum([tran].tran_am) as trans_am,
count([tran].tran_id) as trans_ct,
sum([tran].tran_am) / count([tran].tran_id) as trans_avg
from
[tran]
inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
group by
calendar.year_month
) TransactionsByMonth;
请注意,我从您的原始查询中省略了一些连接和 WHERE
子句表达式,只是为了简化重新创建数据集的任务。我还更改了 trans_avg
列的定义,因为我的 DBMS 不允许我根据列表中前面定义的别名定义 SELECT
列表中的一个元素。 (我没有 Teradata。)
另一种选择是使用 common table expression 或 CTE。虽然有些事情可以用 CTE 做,但不能用子查询做(比如创建递归查询),在这种情况下,这真的只是个人喜好问题。我更喜欢 CTE,因为我发现它们更易于阅读,尤其是在您需要倍数的情况下;多个嵌套的子查询很快就会变得混乱。 CTE 方法如下所示:
with TransactionsByMonth as
(
select
calendar.year_month,
count(distinct [tran].acct_id) as acct_ct,
sum([tran].tran_am) as trans_am,
count([tran].tran_id) as trans_ct,
sum([tran].tran_am) / count([tran].tran_id) as trans_avg
from
[tran]
inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
group by
calendar.year_month
)
select
TransactionsByMonth.year_month,
case
when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
else 'z. other'
end as trans_am_tier,
TransactionsByMonth.acct_ct,
TransactionsByMonth.trans_am,
TransactionsByMonth.trans_ct,
TransactionsByMonth.trans_avg
from
TransactionsByMonth;
正如我提到的,我没有安装 Teradata,但我认为这里的所有内容都应该是标准的 SQL,所以希望它对你有用,或者至少引导你朝着正确的方向前进。
根据您的叙述,您需要先计算每个帐户的平均值(使用 Derived Table 或 CTE),然后计算每层的行数:
select
/*Then, based on that averaged transaction amount, place that account number with associated transaction count and average transaction size into one of the defined buckets, and then sum the total number of accounts per bucket*/
cldr.year_month
,case -- no need to repeat the lower limit
when trans_avg >= 0 and trans_avg < 100 then 'a. >=0 <100'
when trans_avg < 250 then 'b. >=100 <250'
when trans_avg < 500 then 'c. >=250 <500'
when trans_avg < 1000 then 'd. >=500 <1000'
when trans_avg < 1500 then 'e. >=1000 <1500'
when trans_avg < 2000 then 'f. >=1500 <2000'
when trans_avg < 2500 then 'g. >=2000 <2500'
when trans_avg < 5000 then 'h. >=2500 <5000'
when trans_avg < 10000 then 'i. >=5000 <10000'
when trans_avg >= 10000 then 'j. >=10000'
else 'z. other' -- this can only happen for trans_avg < 0
end as trans_am_tier
,count(*)
,Sum(trans_ct)
,Sum(trans_am)
from
(
select
/*for every account number, count # of transactions and average the transaction amount for those transactions
*/
cldr.year_month
,acct.acct_id
,sum(tran.tran_am) as trans_am
,count(tran.tran_id) as trans_ct
,(trans_am / trans_ct) as trans_avg -- why not a simple avg(trans_am)??
from reports.tran as tran
inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
inner join reports.acct as acct on tran.acct_id=acct.acct_id
where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
and tran.prod_type = '4400'
and acct.acct_stat <> 4
and acct.dp_cust_nbr NOT IN (1007,1101)
group by 1,2
) as dt
group by 1,2
order by 1,2
我正在从包含交易数据的 table 中提取数据,并希望获得按平均交易规模和账户对数据进行分桶的结果,然后作为显示账户数、交易数总和的列交易规模和平均交易规模。本质上是这样的:
**raw data**
date acct_nr trans_am
1/3/2017 1234 400
1/20/2017 1234 700
1/22/2017 1234 1100
1/22/2017 2345 300
1/23/2017 2345 800
1/24/2017 3456 1500
1/25/2017 4567 250
1/25/2017 4567 300
1/26/2017 4567 350
**current results**
month tier acct_ct trans_ct trans_am trans_avg
201701 a. >=250 <500 3 5 1600 320
201701 b. >=500 <1000 2 2 1500 750
201701 c. >=1000 <1500 2 2 2600 1300
**expected results**
month tier acct_ct trans_ct trans_am trans_avg (this column should be they key for bucketing, per account)
201701 a. >=250 <500 1 3 900 300
201701 b. >=500 <1000 2 5 3300 660
201701 c. >=1000 <1500 1 1 1500 1500
目前这是我正在使用的脚本,它给我当前结果而不是预期结果:
select
cldr.year_month
,case
when tran.tran_am >= 0 and tran.tran_am < 100 then 'a. >=0 <100'
when tran.tran_am >= 100 and tran.tran_am < 250 then 'b. >=100 <250'
when tran.tran_am >= 250 and tran.tran_am < 500 then 'c. >=250 <500'
when tran.tran_am >= 500 and tran.tran_am < 1000 then 'd. >=500 <1000'
when tran.tran_am >= 1000 and tran.tran_am < 1500 then 'e. >=1000 <1500'
when tran.tran_am >= 1500 and tran.tran_am < 2000 then 'f. >=1500 <2000'
when tran.tran_am >= 2000 and tran.tran_am < 2500 then 'g. >=2000 <2500'
when tran.tran_am >= 2500 and tran.tran_am < 5000 then 'h. >=2500 <5000'
when tran.tran_am >= 5000 and tran.tran_am < 10000 then 'i. >=5000 <10000'
when tran.tran_am >= 10000 then 'j. >=10000'
else 'z. other'
end as trans_am_tier
,count(distinct tran.acct_id) as acct_ct
,sum(tran.tran_am) as trans_am
,count(tran.tran_id) as trans_ct
,(trans_am / trans_ct) as trans_avg
from reports.tran as tran
inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
inner join reports.acct as acct on tran.acct_id=acct.acct_id
where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
and tran.prod_type = '4400'
and acct.acct_stat <> 4
and acct.dp_cust_nbr NOT IN (1007,1101)
group by 1,2
order by 1,2
我知道这与我正在分桶 tran.trans_am 而不是 trans_avg[=31 这一事实有关=].这会通过使用子查询来实现吗?本质上是先计算 trans_avg 然后再分桶?不确定我会怎么做。
基本上,结果应该是"for every account number, count # of transactions and average the transaction amount for those transactions. Then, based on that averaged transaction amount, place that account number with associated transaction count and average transaction size into one of the defined buckets, and then sum the total number of accounts per bucket"。因此,结果应按帐户和交易层分组,分桶应由 trans_avg.
确定顺便说一句,我是一名分析师,只有对 DBMS 的读取权限。无法创建临时 tables 或任何类似的东西。
添加到原始数据、当前结果和预期结果的编辑,以阐明我正在努力实现的目标。
你是对的,你想要的方法是先聚合数据,然后根据 trans_avg
而不是 tran_am
将聚合记录分配给层。您也可以通过子查询实现这一点,就像这样:
-- Create sample data.
create table [tran]
(
tran_id bigint,
acct_id bigint,
tran_am bigint,
tran_eff_dt date
);
insert [tran] values
(1, 1234, 400, '20170103'),
(2, 1234, 700, '20170120'),
(3, 1234, 1100, '20170122');
create table calendar
(
calendar_date date,
year_month char(6)
);
insert calendar values
('20170103', '201701'),
('20170120', '201701'),
('20170122', '201701');
-- Aggregate transactions first, then assign to a tier.
select
TransactionsByMonth.year_month,
case
when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
else 'z. other'
end as trans_am_tier,
TransactionsByMonth.acct_ct,
TransactionsByMonth.trans_am,
TransactionsByMonth.trans_ct,
TransactionsByMonth.trans_avg
from
(
select
calendar.year_month,
count(distinct [tran].acct_id) as acct_ct,
sum([tran].tran_am) as trans_am,
count([tran].tran_id) as trans_ct,
sum([tran].tran_am) / count([tran].tran_id) as trans_avg
from
[tran]
inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
group by
calendar.year_month
) TransactionsByMonth;
请注意,我从您的原始查询中省略了一些连接和 WHERE
子句表达式,只是为了简化重新创建数据集的任务。我还更改了 trans_avg
列的定义,因为我的 DBMS 不允许我根据列表中前面定义的别名定义 SELECT
列表中的一个元素。 (我没有 Teradata。)
另一种选择是使用 common table expression 或 CTE。虽然有些事情可以用 CTE 做,但不能用子查询做(比如创建递归查询),在这种情况下,这真的只是个人喜好问题。我更喜欢 CTE,因为我发现它们更易于阅读,尤其是在您需要倍数的情况下;多个嵌套的子查询很快就会变得混乱。 CTE 方法如下所示:
with TransactionsByMonth as
(
select
calendar.year_month,
count(distinct [tran].acct_id) as acct_ct,
sum([tran].tran_am) as trans_am,
count([tran].tran_id) as trans_ct,
sum([tran].tran_am) / count([tran].tran_id) as trans_avg
from
[tran]
inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
group by
calendar.year_month
)
select
TransactionsByMonth.year_month,
case
when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
else 'z. other'
end as trans_am_tier,
TransactionsByMonth.acct_ct,
TransactionsByMonth.trans_am,
TransactionsByMonth.trans_ct,
TransactionsByMonth.trans_avg
from
TransactionsByMonth;
正如我提到的,我没有安装 Teradata,但我认为这里的所有内容都应该是标准的 SQL,所以希望它对你有用,或者至少引导你朝着正确的方向前进。
根据您的叙述,您需要先计算每个帐户的平均值(使用 Derived Table 或 CTE),然后计算每层的行数:
select
/*Then, based on that averaged transaction amount, place that account number with associated transaction count and average transaction size into one of the defined buckets, and then sum the total number of accounts per bucket*/
cldr.year_month
,case -- no need to repeat the lower limit
when trans_avg >= 0 and trans_avg < 100 then 'a. >=0 <100'
when trans_avg < 250 then 'b. >=100 <250'
when trans_avg < 500 then 'c. >=250 <500'
when trans_avg < 1000 then 'd. >=500 <1000'
when trans_avg < 1500 then 'e. >=1000 <1500'
when trans_avg < 2000 then 'f. >=1500 <2000'
when trans_avg < 2500 then 'g. >=2000 <2500'
when trans_avg < 5000 then 'h. >=2500 <5000'
when trans_avg < 10000 then 'i. >=5000 <10000'
when trans_avg >= 10000 then 'j. >=10000'
else 'z. other' -- this can only happen for trans_avg < 0
end as trans_am_tier
,count(*)
,Sum(trans_ct)
,Sum(trans_am)
from
(
select
/*for every account number, count # of transactions and average the transaction amount for those transactions
*/
cldr.year_month
,acct.acct_id
,sum(tran.tran_am) as trans_am
,count(tran.tran_id) as trans_ct
,(trans_am / trans_ct) as trans_avg -- why not a simple avg(trans_am)??
from reports.tran as tran
inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
inner join reports.acct as acct on tran.acct_id=acct.acct_id
where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
and tran.prod_type = '4400'
and acct.acct_stat <> 4
and acct.dp_cust_nbr NOT IN (1007,1101)
group by 1,2
) as dt
group by 1,2
order by 1,2