将多个 select 查询合并为一个，以避免多次传递巨大的 table

Question

手头问题的非常简化的设置。

Table A 有列 rz_id 和 sHashA。 TableA很大

Table B 有列 scode 和 sHashB。可以有很多 sHashB 值对应于特定的 scode 值。 Table B比较多小于 table A.

对于每个 scode 值（大约 200 个）我必须执行如下查询（在这种情况下 scode 是 500）。

select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 500);

对于每个 scode 值，我都编写了一个类似上面的查询，这样我最终得到了 200 个这样的查询

select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 500);
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 501);
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 502);
.
.
.
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 700);

问题是这最终会超过 table 200 次这很耗时。我希望能够通过单次通过（单次查询）。

我想制作一个 table 行数与 table A 一样多通过

之类的查询将其他列作为 table B

select /*+ streamtable(a) */ a.*, if(substr(sHashA, 1, 5) in (select
substr(sHashB, 1, 5) from B where scode = 500, 1, 0) as scode_500,
if(substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where
scode = 501, 1, 0) as scode_501, ... if(substr(sHashA, 1, 5) in
(select substr(sHashB, 1, 5) from B where scode = 700, 1, 0) as
scode_700 from A a;

这将在对应于 table A 的每行 scode 的 200 列中的每一列中输出 0 或 1。稍后我可以对列求和以获得计数。由于我也对估计任何两个 scodes 之间的计数重叠感兴趣，所以我想到了上面 table.

但我收到解析错误，我怀疑内部不允许查询 IF 语句。

最后的问题是：我如何将所有这些查询减少到一个查询中，以便我最终只浏览一次巨大的 table 行？还请建议处理此计数的替代方法，请记住我也对重叠感兴趣。

Answer 1

这样的事情怎么样;

select count(distinct A.rz_id), B.scode
from A,B
where substr(A.sHashA, 1, 5) = substr(B.sHashB, 1,5)
and B.scode in (500,501,...)
group by B.scode

单遍获取所有数据

将多个 select 查询合并为一个，以避免多次传递巨大的 table

combine multiple select queries into one to avoid multiple pass over a huge table

sql

hadoop

hive

bigdata

hiveql