Pig 中跨字段的值计数
Count of values across fields in Pig
我有以下测试数据。
A B C
M O
M M M
M M M
N O
P N
我想获得此样本测试数据中的条目总数,即 12
我有下面的代码来做同样的事情,但我得到的结果不正确。
任何关于如何纠正的帮助都会有所帮助。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);
给出的答案是 15,而不是 12。
我还想获得每个值的计数,例如 M=7、N=2、O=2、P=1。
我写了下面的代码。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
A = FOREACH grp {
B =FILTER test.A=='M' OR test.B=='M' OR test.C=='M';
GENERATE group, COUNT(B);
};
我遇到错误 "Scalar has more than one row in the output"。
您还在计算最终的 count.Modify 脚本中的列名,以忽略第一行,然后分组并计数。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
ranked = rank test;
test1 = Filter ranked by ([=10=] > 1); --Note:rank_test should work.
values = FOREACH test1 GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);
我有以下测试数据。
A B C
M O
M M M
M M M
N O
P N
我想获得此样本测试数据中的条目总数,即 12
我有下面的代码来做同样的事情,但我得到的结果不正确。
任何关于如何纠正的帮助都会有所帮助。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);
给出的答案是 15,而不是 12。
我还想获得每个值的计数,例如 M=7、N=2、O=2、P=1。 我写了下面的代码。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
A = FOREACH grp {
B =FILTER test.A=='M' OR test.B=='M' OR test.C=='M';
GENERATE group, COUNT(B);
};
我遇到错误 "Scalar has more than one row in the output"。
您还在计算最终的 count.Modify 脚本中的列名,以忽略第一行,然后分组并计数。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
ranked = rank test;
test1 = Filter ranked by ([=10=] > 1); --Note:rank_test should work.
values = FOREACH test1 GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);