PIG:如何创建基于 table 的百分比 (%)?
PIG: How to create percentage (%) based table?
我正在尝试创建一个 table 来显示出现次数的百分比。例如:我有一个名为 example 的 table,其中包含的数据为:
class, value
------ -------
1 , abc
1 , abc
1 , xyz
1 , abc
2 , xyz
2 , abc
此处,对于 class 值 1,'abc' 出现了 3 次,而 'xyz' 在总出现次数中仅出现了 4 次。对于 class 值 2,'abc' 和 'xyz' 出现一次(总共出现两次)。
所以,输出是:
class, %_of_abc, %_of_xyz
------ -------- --------
1 , 75 , 25
2 , 50 , 50
知道如何在两个列值都发生变化的情况下做到这一点吗?我正在考虑使用 GROUP 来完成。但不确定我是否按 class 值对其进行分组,它对我有何帮助。
有点复杂,但这里是解决方案
grunt> Dump A;
(1,abc)
(1,abc)
(1,xyz)
(1,abc)
(2,xyz)
(2,abc)
grunt> B = Group A by class;
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt;
grunt> D = Group A by (class,value);
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt;
grunt> F = foreach E generate [=10=] as class:int, as value:chararray, tot_cnt;
grunt> G = JOIN F BY class,C BY class;
grunt> H = foreach G generate [=10=] as class, as value,(*100/) as perc;
grunt> Dump H;
(1,xyz,25)
(1,abc,75)
(2,xyz,50)
(2,abc,50)
I = grouy H by class;
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc));
Dump J;
(1,75,25)
(2,50,50)
我正在尝试创建一个 table 来显示出现次数的百分比。例如:我有一个名为 example 的 table,其中包含的数据为:
class, value
------ -------
1 , abc
1 , abc
1 , xyz
1 , abc
2 , xyz
2 , abc
此处,对于 class 值 1,'abc' 出现了 3 次,而 'xyz' 在总出现次数中仅出现了 4 次。对于 class 值 2,'abc' 和 'xyz' 出现一次(总共出现两次)。
所以,输出是:
class, %_of_abc, %_of_xyz
------ -------- --------
1 , 75 , 25
2 , 50 , 50
知道如何在两个列值都发生变化的情况下做到这一点吗?我正在考虑使用 GROUP 来完成。但不确定我是否按 class 值对其进行分组,它对我有何帮助。
有点复杂,但这里是解决方案
grunt> Dump A;
(1,abc)
(1,abc)
(1,xyz)
(1,abc)
(2,xyz)
(2,abc)
grunt> B = Group A by class;
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt;
grunt> D = Group A by (class,value);
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt;
grunt> F = foreach E generate [=10=] as class:int, as value:chararray, tot_cnt;
grunt> G = JOIN F BY class,C BY class;
grunt> H = foreach G generate [=10=] as class, as value,(*100/) as perc;
grunt> Dump H;
(1,xyz,25)
(1,abc,75)
(2,xyz,50)
(2,abc,50)
I = grouy H by class;
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc));
Dump J;
(1,75,25)
(2,50,50)