Pig:过滤掉关系中的最后一个元组
Pig: Filtering out the last tuple in a relation
我在 hdfs 中有以下数据,我想删除最后一行。
/user/cloudera/test/testfile.csv
Day,TimeCST,Conditions
1,12:53 AM,Clear
1,1:53 AM,Clear
1,2:53 AM,Clear
1,3:53 AM,Clear
1,4:53 AM,Clear
1,5:53 AM,Clear
1,6:53 AM,Clear
1,7:53 AM,Clear
1,8:53 AM,Clear
1,9:53 AM,Clear
1,10:53 AM,Clear
1,11:53 AM,Clear
1,12:53 PM,Clear
1,1:53 PM,Clear
1,2:53 PM,Clear
1,3:53 PM,Clear
1,4:53 PM,Clear
1,5:53 PM,Clear
首先,我load
数据,去掉表头,得到rows/tuples的数量:
rawdata = LOAD 'hdfs:/user/cloudera/test/testfile.csv' using PigStorage(',') AS (day:int, timecst:chararray, condition:chararray);
filtereddata = FILTER rawdata BY day > 0; --filters out header
rowcount = FOREACH (GROUP filtereddata ALL) GENERATE COUNT_STAR(filtereddata);
dump rowcount; --Prints (18)
接下来,我rank
数据然后尝试使用生成的行号来filter
出最后row/tuple:
ranked = RANK filtereddata;
weatherdata = FILTER ranked BY [=12=] != rowcount.[=12=];
以上 filter
操作失败并出现以下错误:
ERROR 2017: Internal error creating job configuration.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias weatherdata.....
但是,如果我将行数硬编码到我的脚本中,作业运行良好:
weatherdata = FILTER ranked BY [=14=] != 18;
我想避免对行数进行硬编码。你有没有发现我可能会误入歧途?谢谢。
Apache Pig 版本 0.12.0-cdh5.5.0(已导出)
2015 年 11 月 9 日编译,12:41:48
可能要投
weatherdata = FILTER ranked BY [=10=] != (int)rowcount.[=10=];
结合使用强制转换和命名变量似乎可以解决问题。以下作品:
rawdata = LOAD 'hdfs:/home/hduser/test/testfile.csv' using PigStorage(',') AS (day:int, timecst:chararray, condition:chararray);
filtereddata = FILTER rawdata BY day > 0; --filters out header
rowcount = FOREACH (GROUP filtereddata ALL) GENERATE COUNT_STAR(filtereddata) AS mycount:long;
ranked = RANK filtereddata;
weatherdata = FILTER ranked BY [=10=] != rowcount.mycount;
dump weatherdata;
我在 hdfs 中有以下数据,我想删除最后一行。
/user/cloudera/test/testfile.csv
Day,TimeCST,Conditions
1,12:53 AM,Clear
1,1:53 AM,Clear
1,2:53 AM,Clear
1,3:53 AM,Clear
1,4:53 AM,Clear
1,5:53 AM,Clear
1,6:53 AM,Clear
1,7:53 AM,Clear
1,8:53 AM,Clear
1,9:53 AM,Clear
1,10:53 AM,Clear
1,11:53 AM,Clear
1,12:53 PM,Clear
1,1:53 PM,Clear
1,2:53 PM,Clear
1,3:53 PM,Clear
1,4:53 PM,Clear
1,5:53 PM,Clear
首先,我load
数据,去掉表头,得到rows/tuples的数量:
rawdata = LOAD 'hdfs:/user/cloudera/test/testfile.csv' using PigStorage(',') AS (day:int, timecst:chararray, condition:chararray);
filtereddata = FILTER rawdata BY day > 0; --filters out header
rowcount = FOREACH (GROUP filtereddata ALL) GENERATE COUNT_STAR(filtereddata);
dump rowcount; --Prints (18)
接下来,我rank
数据然后尝试使用生成的行号来filter
出最后row/tuple:
ranked = RANK filtereddata;
weatherdata = FILTER ranked BY [=12=] != rowcount.[=12=];
以上 filter
操作失败并出现以下错误:
ERROR 2017: Internal error creating job configuration.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias weatherdata.....
但是,如果我将行数硬编码到我的脚本中,作业运行良好:
weatherdata = FILTER ranked BY [=14=] != 18;
我想避免对行数进行硬编码。你有没有发现我可能会误入歧途?谢谢。
Apache Pig 版本 0.12.0-cdh5.5.0(已导出) 2015 年 11 月 9 日编译,12:41:48
可能要投
weatherdata = FILTER ranked BY [=10=] != (int)rowcount.[=10=];
结合使用强制转换和命名变量似乎可以解决问题。以下作品:
rawdata = LOAD 'hdfs:/home/hduser/test/testfile.csv' using PigStorage(',') AS (day:int, timecst:chararray, condition:chararray);
filtereddata = FILTER rawdata BY day > 0; --filters out header
rowcount = FOREACH (GROUP filtereddata ALL) GENERATE COUNT_STAR(filtereddata) AS mycount:long;
ranked = RANK filtereddata;
weatherdata = FILTER ranked BY [=10=] != rowcount.mycount;
dump weatherdata;