使我下面的 Pig 代码变得简单的建议

advice to make my below Pig code simple

这是我的代码,我做了两组所有操作,我的代码有效。我的目的是用他们的总分生成所有学生唯一用户数,学生位于 CA 唯一用户数。想知道是否有好的建议可以让我的代码变得简单,只使用一个组操作,或者有什么建设性的想法可以让代码变得简单,例如只使用一个 FOREACH 操作?谢谢

student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;

student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);

示例输入(学生 ID、位置 ID、mathScore、verbScore),

1 1 10  20
2 1 20  30
3 1 30  40
4 2 30  50
5 2 30  50
6 3 30  50

示例输出(唯一用户、CA 中的唯一用户、所有学生的 mathScore 总和、所有学生的 verb Score 总和),

7 3 150 240

提前致谢, 林

您可能正在寻找这个。

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);

gdata = group data all;

result = foreach gdata {
        student_CA = filter data by lid == 1; 
        student_CA_sum = SUM( student_CA.sid ) ;
        student_CA_count = COUNT( student_CA.sid ) ;
        mathScore = SUM(data.ms);
        verbScore = SUM(data.vs);
        GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore  as mathScore, verbScore as verbScore;
 };

输出为:

grunt> dump result
    (6,3,150,240)
grunt> describe result
    result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}

首先加载hadoop文件系统中的文件(student)。执行以下操作。

split student into student_CA if locationId == 1, student_Other if locationId != 1;

student_CA_all = group student_CA all;

student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;

student_Other_all = group student_Other all;

student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;

student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;

student_summary_all = group student_CAandOther_all_summary all;

student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;

输出:

dump student_summary;
(6,3,150,240)

希望这对您有所帮助:)

在解决您的问题时,我也遇到了 PIG 的问题。我认为这是因为在 UNION 命令中进行了不正确的异常处理。实际上,如果您执行该命令,它可以挂起您的命令行提示符,而没有正确的错误消息。如果你愿意,我可以把它的片段分享给你。

接受的答案有逻辑错误。

尝试使用以下输入文件

1 1 10  20
2 1 20  30
3 1 30  40
4 2 30  50
5 2 30  50
6 3 30  50
7 1 10  10

输出将是

(13,4,160,250)

输出应该是

(7,4.170,260)

我已修改脚本使其正常工作。

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);

gdata = group data all;

result = foreach gdata {
    student_CA_sum = COUNT( data.sid ) ;
    student_CA = filter data by lid == 1;
    student_CA_count = COUNT( student_CA.sid ) ;
    mathScore = SUM(data.ms);
    verbScore = SUM(data.vs);
    GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore  as mathScore, verbScore as verbScore;

};

输出

(7,4,160,250)