Apache Pig 的基本统计
Basic statistics with Apache Pig
我正在尝试使用 Apache Pig 来表征具有某些属性的行的分数。
例如,如果数据如下所示:
a,15
a,16
a,17
b,3
b,16
我想得到:
a,0.6
b,0.4
我正在尝试执行以下操作:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
总计 = (5),但是当我尝试使用此 'total' 时:
fractions = FOREACH (GROUP A by [=13=]) GENERATE COUNT(A)/total;
我收到一个错误。
显然 COUNT() returns 某种投影和两种投影(在计算总数和分数时)应该是一致的。有没有办法使这项工作?或者可能只是将总数转换为数字并避免这种投影一致性要求?
您必须投影并将其转换为双倍:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by [=10=]) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.[=10=],(double)rows./(double)total.[=10=];
另一种相同的方法:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by [=10=];
C = FOREACH B GENERATE group, COUNT(test.[=10=]);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.[=10=]);
F = CROSS C,E;
G = FOREACH F GENERATE [=10=],,,(double)(*100/);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
出于某种原因,对@inquisitive-mind 建议的内容进行了以下修改:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by [=10=]) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.[=10=];
我正在尝试使用 Apache Pig 来表征具有某些属性的行的分数。
例如,如果数据如下所示:
a,15
a,16
a,17
b,3
b,16
我想得到:
a,0.6
b,0.4
我正在尝试执行以下操作:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
总计 = (5),但是当我尝试使用此 'total' 时:
fractions = FOREACH (GROUP A by [=13=]) GENERATE COUNT(A)/total;
我收到一个错误。
显然 COUNT() returns 某种投影和两种投影(在计算总数和分数时)应该是一致的。有没有办法使这项工作?或者可能只是将总数转换为数字并避免这种投影一致性要求?
您必须投影并将其转换为双倍:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by [=10=]) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.[=10=],(double)rows./(double)total.[=10=];
另一种相同的方法:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by [=10=];
C = FOREACH B GENERATE group, COUNT(test.[=10=]);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.[=10=]);
F = CROSS C,E;
G = FOREACH F GENERATE [=10=],,,(double)(*100/);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
出于某种原因,对@inquisitive-mind 建议的内容进行了以下修改:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by [=10=]) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.[=10=];