使用 pig latin 将列分成组

Question

我有一个包含两列的 table（代码：chararray，sp:double）

我想将第二个字段 sp 分成不同的组（例如基于条件，如 (<25),(>25 <45),(>=45)。

输入

code sp
t001 60.0
t001 75.0
a003 34.0
t001 60.0
a003 23.0
a003 23.0
t001 45.0
t001 10.0
t001 8.0
a003 20.0
t001 38.0
a003 55.0
a003 50.0
t001 08.0
a003 44.0

期望的输出：

code    bin1     bin2        bin3
       (<25)   (>25 <45)    >=45
t001    3          1          4 
a003    3          2          2

我正在尝试如下脚本：

data = LOAD 'Sandy/rd.csv' using PigStorage(',') As (code:chararray,sp:double);

data2 = DISTINCT data;

selfiltnew = FOREACH data2 generate code, sp;
group_new = GROUP selfiltnew by (code,sp);

newselt = FOREACH group_new GENERATE selfiltnew.code AS code,selfiltnew.sp AS sp;

bin1 = filter newselt by sp < 25.0;
grp1 = FOREACH bin1 GENERATE newselt.code AS code, COUNT(newselt.sp) AS (sp1:double);

bin2 = filter newselt by sp < 45 and sp >= 25;
grp2 = FOREACH bin3 GENERATE newselt.code AS code, COUNT(newselt.sp) AS (sp2:double);

bin3 = filter newselt by sp >=75;
grp3 = FOREACH bin3 GENERATE newselt.code AS code, COUNT(newselt.sp) AS (sp3:double);

newbin = JOIN grp1 by code,grp2 by code,grp3 by code;

newtable = FOREACH newbin GENERATE grp1::group.code AS code, SUM(sp1) AS bin1,SUM(sp2) AS bin2,SUM(sp3) AS bin3;

data2 = FOREACH newtable GENERATE code, bin1, bin2, bin3;
dump newtable;

如何使用 pig latin 获得正确的输出？

Answer 1

在使用 COUNT 之前必须先使用 GROUP BY

来源： COUNT
用法
使用 COUNT 函数计算包中元素的数量。 COUNT 需要前面的 GROUP ALL 语句用于全局计数和 GROUP BY 语句用于组计数。

bin1 = filter newselt by sp < 25.0;
grouped1 = GROUP bin1 by (newselt.code);
grp1 = FOREACH grouped1 GENERATE group AS code, COUNT(newselt.sp) AS (sp1:double);

Answer 2

通过查看您想要的输出，不需要 DISTINCT。此外，无需执行您正在执行的某些步骤。请注意，如果源由 space 分隔，则应使用 PigStorage(' ') 而不是 PigStorage(',') 根据@inquisitive_mind 指出的内容，代码如下：

data = LOAD 'Sandy/rd.csv' using PigStorage(' ') As (code:chararray,sp:double);
bin1 = filter data by sp < 25.0;
grouped1 = GROUP bin1 by code;
grp1 = FOREACH grouped1 GENERATE group AS code, COUNT(bin1.sp) AS (sp1:double);
bin2 = filter data by (sp >= 25.0 AND sp<45);
grouped2 = GROUP bin2 by code;
grp2 = FOREACH grouped2 GENERATE group AS code, COUNT(bin2.sp) AS (sp2:double);
bin3 = filter data by sp >= 45.0;
grouped3 = GROUP bin3 by code;
grp3 = FOREACH grouped3 GENERATE group AS code, COUNT(bin3.sp) AS (sp3:double);
result= JOIN grp1 BY code, grp2 by code, grp3 by code;
final_result = FOREACH result GENERATE grp1::code as code, grp1::sp1 as bin1, grp2::sp2 as bin2, grp3::sp3 as bin3;

这是输出：

使用 pig latin 将列分成组

Column splitting into groups using pig latin

apache-pig

hadoop2