根据配置单元中的列值获取每种可能性的总和 - 聚合 table

Question

我有 table 以下列。

对于上述 table 我需要根据 ind 值组合按日期获取每张 cd 的计数并期望以下输出 table.

对于输出中的第 2 行 table，id 45 有一个 OK，一个 no，因此需要将日期 2020-02-24 计数为 1，因为它有 1 个 ok

类似地，对于第4行，它有notok和no，所以对于这个组合，我们需要对id 30的最大日期取notok

我需要在 hive 中开发它，有人可以建议我们如何实现它。我尝试编写单独的子查询，但由于许多连接，它会影响性能（我正在编写单独的查询来分别计算每个组合并连接结果）

针对其他场景更新：

我在 table 中有以下数据。

当我们赋予权重时，它看起来如下

第一种情况：当我们按日期分组时，对于 2020 年 1 月 1 日，我得到的计数是 1，这是正确的

第二个案例：对于日期 1/2/2020，我们假设只得到 notOk 的计数 1，但它给出了 2（因为它正在为 cd 1 寻找 1/2/2020 的第一个案例行。

还有另一个科学：

当我在不同日期对同一张 CD 进行多条记录时，没有给出正确的结果。

我在不同的日期对 CD 1 有 2 个“ok”。所以我们只需要考虑计数 1，我们需要删除其他 ok，它是 1/1/2020 或 1/2/2020，因为它是相同的 cd。

非常感谢您的帮助。

谢谢，巴布

Answer 1

使用条件聚合：

select date,
    sum(case when ind = 'ok'     then 1 else 0 end) ok_count,
    sum(case when ind = 'No'     then 1 else 0 end) no_count,
    sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from mytable
group by date

或者，如果您只想考虑每个 id 的最新行，我们可以先使用 row_number() 进行预过滤：

select date,
    sum(case when ind = 'ok'     then 1 else 0 end) ok_count,
    sum(case when ind = 'No'     then 1 else 0 end) no_count,
    sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from (
    select t.*, row_number() over(partition by id order by date desc) rn
    from mytable t
) t
where rn = 1
group by date

Answer 2

如果您需要获取给定 ID 的最新日期的 ind 计数，则查询将如下所示

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (select max(dt) from mytable_test group by cd) group by dt;

但是，如果存在某些真实的table条件，例如：对于给定的ID，
- 如果它同时具有 OK 和 No，则选择 OK。 -如果它同时具有否和不正常，则选择不正常。

那么它可能不是一个非常有效的方法，但可以正常工作。

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind)  group by dt;

解释

首先为您提供的 ind 条件提供一些权重。在这种情况下，根据您的示例，我假设 NOK 的权重最低，OK 中等，而不是最高

select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test

    +-------------+-----+---------+---------+--+
    |     dt      | cd  |   ind   | ind_wt  |
    +-------------+-----+---------+---------+--+
    | 2020-08-24  | 10  | ok      | 1       |
    | 2020-02-21  | 45  | No      | 0       |
    | 2020-02-24  | 45  | ok      | 1       |
    | 2020-08-25  | 20  | No      | 0       |
    | 2020-10-09  | 30  | not ok  | 2       |
    | 2020-10-13  | 30  | not ok  | 2       |
    | 2020-10-21  | 30  | No      | 0       |
    | 2020-10-23  | 30  | No      | 0       |
    | 2020-09-14  | 12  | No      | 0       |
    +-------------+-----+---------+---------+--+

接下来获取每个 CD 的最大权重（在 wt 块中）

select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd

+-----+------------------+--+
| cd  | ind_to_consider  |
+-----+------------------+--+
| 10  | 1                |
| 12  | 0                |
| 20  | 0                |
| 30  | 2                |
| 45  | 1                |
+-----+------------------+--+

现在您必须将权重解码回指标，以便您可以获得每个 cd 和 max 指标的最新日期。

select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind

+-------------+--+
|     _c0     |
+-------------+--+
| 2020-08-24  |
| 2020-09-14  |
| 2020-08-25  |
| 2020-10-13  |
| 2020-02-24  |
+-------------+--+

然后使用这些日期得到枢轴点

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind)  group by dt;



+-------------+-----------+-----------+---------------+--+
|     dt      | ok_count  | no_count  | not_ok_count  |
+-------------+-----------+-----------+---------------+--+
| 2020-02-24  | 1         | 0         | 0             |
| 2020-08-24  | 1         | 0         | 0             |
| 2020-08-25  | 0         | 1         | 0             |
| 2020-09-14  | 0         | 1         | 0             |
| 2020-10-13  | 0         | 0         | 1             |
+-------------+-----------+-----------+---------------+--+

根据配置单元中的列值获取每种可能性的总和 - 聚合 table

get sum of the each possibility based on the column value in hive - Aggregate table

sql

hive

pivot

greatest-n-per-group

hiveql