根据配置单元中的列值获取每种可能性的总和 - 聚合 table
get sum of the each possibility based on the column value in hive - Aggregate table
我有 table 以下列。
对于上述 table 我需要根据 ind 值组合按日期获取每张 cd 的计数并期望以下输出 table.
对于输出中的第 2 行 table,id 45 有一个 OK,一个 no,因此需要将日期 2020-02-24 计数为 1,因为它有 1 个 ok
类似地,对于第4行,它有notok和no,所以对于这个组合,我们需要对id 30的最大日期取notok
我需要在 hive 中开发它,有人可以建议我们如何实现它。我尝试编写单独的子查询,但由于许多连接,它会影响性能(我正在编写单独的查询来分别计算每个组合并连接结果)
针对其他场景更新:
我在 table 中有以下数据。
当我们赋予权重时,它看起来如下
第一种情况:当我们按日期分组时,对于 2020 年 1 月 1 日,我得到的计数是 1,这是正确的
第二个案例:对于日期 1/2/2020,我们假设只得到 notOk 的计数 1,但它给出了 2(因为它正在为 cd 1 寻找 1/2/2020 的第一个案例行。
还有另一个科学:
当我在不同日期对同一张 CD 进行多条记录时,没有给出正确的结果。
我在不同的日期对 CD 1 有 2 个“ok”。所以我们只需要考虑计数 1,我们需要删除其他 ok,它是 1/1/2020 或 1/2/2020,因为它是相同的 cd。
非常感谢您的帮助。
谢谢,
巴布
使用条件聚合:
select date,
sum(case when ind = 'ok' then 1 else 0 end) ok_count,
sum(case when ind = 'No' then 1 else 0 end) no_count,
sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from mytable
group by date
或者,如果您只想考虑每个 id
的最新行,我们可以先使用 row_number()
进行预过滤:
select date,
sum(case when ind = 'ok' then 1 else 0 end) ok_count,
sum(case when ind = 'No' then 1 else 0 end) no_count,
sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from (
select t.*, row_number() over(partition by id order by date desc) rn
from mytable t
) t
where rn = 1
group by date
如果您需要获取给定 ID 的最新日期的 ind 计数,则查询将如下所示
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (select max(dt) from mytable_test group by cd) group by dt;
但是,如果存在某些真实的table条件,例如:对于给定的ID,
- 如果它同时具有 OK 和 No,则选择 OK。
-如果它同时具有否和不正常,则选择不正常。
那么它可能不是一个非常有效的方法,但可以正常工作。
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind) group by dt;
解释
首先为您提供的 ind 条件提供一些权重。
在这种情况下,根据您的示例,我假设 NOK 的权重最低,OK 中等,而不是最高
select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test
+-------------+-----+---------+---------+--+
| dt | cd | ind | ind_wt |
+-------------+-----+---------+---------+--+
| 2020-08-24 | 10 | ok | 1 |
| 2020-02-21 | 45 | No | 0 |
| 2020-02-24 | 45 | ok | 1 |
| 2020-08-25 | 20 | No | 0 |
| 2020-10-09 | 30 | not ok | 2 |
| 2020-10-13 | 30 | not ok | 2 |
| 2020-10-21 | 30 | No | 0 |
| 2020-10-23 | 30 | No | 0 |
| 2020-09-14 | 12 | No | 0 |
+-------------+-----+---------+---------+--+
接下来获取每个 CD 的最大权重(在 wt 块中)
select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd
+-----+------------------+--+
| cd | ind_to_consider |
+-----+------------------+--+
| 10 | 1 |
| 12 | 0 |
| 20 | 0 |
| 30 | 2 |
| 45 | 1 |
+-----+------------------+--+
现在您必须将权重解码回指标,以便您可以获得每个 cd 和 max 指标的最新日期。
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind
+-------------+--+
| _c0 |
+-------------+--+
| 2020-08-24 |
| 2020-09-14 |
| 2020-08-25 |
| 2020-10-13 |
| 2020-02-24 |
+-------------+--+
然后使用这些日期得到枢轴点
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind) group by dt;
+-------------+-----------+-----------+---------------+--+
| dt | ok_count | no_count | not_ok_count |
+-------------+-----------+-----------+---------------+--+
| 2020-02-24 | 1 | 0 | 0 |
| 2020-08-24 | 1 | 0 | 0 |
| 2020-08-25 | 0 | 1 | 0 |
| 2020-09-14 | 0 | 1 | 0 |
| 2020-10-13 | 0 | 0 | 1 |
+-------------+-----------+-----------+---------------+--+
我有 table 以下列。
对于上述 table 我需要根据 ind 值组合按日期获取每张 cd 的计数并期望以下输出 table.
对于输出中的第 2 行 table,id 45 有一个 OK,一个 no,因此需要将日期 2020-02-24 计数为 1,因为它有 1 个 ok
类似地,对于第4行,它有notok和no,所以对于这个组合,我们需要对id 30的最大日期取notok
我需要在 hive 中开发它,有人可以建议我们如何实现它。我尝试编写单独的子查询,但由于许多连接,它会影响性能(我正在编写单独的查询来分别计算每个组合并连接结果)
针对其他场景更新:
我在 table 中有以下数据。
当我们赋予权重时,它看起来如下
第一种情况:当我们按日期分组时,对于 2020 年 1 月 1 日,我得到的计数是 1,这是正确的
第二个案例:对于日期 1/2/2020,我们假设只得到 notOk 的计数 1,但它给出了 2(因为它正在为 cd 1 寻找 1/2/2020 的第一个案例行。
还有另一个科学:
当我在不同日期对同一张 CD 进行多条记录时,没有给出正确的结果。
我在不同的日期对 CD 1 有 2 个“ok”。所以我们只需要考虑计数 1,我们需要删除其他 ok,它是 1/1/2020 或 1/2/2020,因为它是相同的 cd。
非常感谢您的帮助。
谢谢, 巴布
使用条件聚合:
select date,
sum(case when ind = 'ok' then 1 else 0 end) ok_count,
sum(case when ind = 'No' then 1 else 0 end) no_count,
sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from mytable
group by date
或者,如果您只想考虑每个 id
的最新行,我们可以先使用 row_number()
进行预过滤:
select date,
sum(case when ind = 'ok' then 1 else 0 end) ok_count,
sum(case when ind = 'No' then 1 else 0 end) no_count,
sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from (
select t.*, row_number() over(partition by id order by date desc) rn
from mytable t
) t
where rn = 1
group by date
如果您需要获取给定 ID 的最新日期的 ind 计数,则查询将如下所示
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (select max(dt) from mytable_test group by cd) group by dt;
但是,如果存在某些真实的table条件,例如:对于给定的ID,
- 如果它同时具有 OK 和 No,则选择 OK。
-如果它同时具有否和不正常,则选择不正常。
那么它可能不是一个非常有效的方法,但可以正常工作。
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind) group by dt;
解释
首先为您提供的 ind 条件提供一些权重。 在这种情况下,根据您的示例,我假设 NOK 的权重最低,OK 中等,而不是最高
select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test
+-------------+-----+---------+---------+--+
| dt | cd | ind | ind_wt |
+-------------+-----+---------+---------+--+
| 2020-08-24 | 10 | ok | 1 |
| 2020-02-21 | 45 | No | 0 |
| 2020-02-24 | 45 | ok | 1 |
| 2020-08-25 | 20 | No | 0 |
| 2020-10-09 | 30 | not ok | 2 |
| 2020-10-13 | 30 | not ok | 2 |
| 2020-10-21 | 30 | No | 0 |
| 2020-10-23 | 30 | No | 0 |
| 2020-09-14 | 12 | No | 0 |
+-------------+-----+---------+---------+--+
接下来获取每个 CD 的最大权重(在 wt 块中)
select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd
+-----+------------------+--+
| cd | ind_to_consider |
+-----+------------------+--+
| 10 | 1 |
| 12 | 0 |
| 20 | 0 |
| 30 | 2 |
| 45 | 1 |
+-----+------------------+--+
现在您必须将权重解码回指标,以便您可以获得每个 cd 和 max 指标的最新日期。
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind
+-------------+--+
| _c0 |
+-------------+--+
| 2020-08-24 |
| 2020-09-14 |
| 2020-08-25 |
| 2020-10-13 |
| 2020-02-24 |
+-------------+--+
然后使用这些日期得到枢轴点
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind) group by dt;
+-------------+-----------+-----------+---------------+--+
| dt | ok_count | no_count | not_ok_count |
+-------------+-----------+-----------+---------------+--+
| 2020-02-24 | 1 | 0 | 0 |
| 2020-08-24 | 1 | 0 | 0 |
| 2020-08-25 | 0 | 1 | 0 |
| 2020-09-14 | 0 | 1 | 0 |
| 2020-10-13 | 0 | 0 | 1 |
+-------------+-----------+-----------+---------------+--+