如何过滤空输出文件?
How to filter empty output file?
这是我的猪脚本。
data = load 's3a://sessionlog/2016-05-28/' using SegmentationDataLoader() as (cookie:chararray,tags_and_pageref:map[]);
tags_data = foreach data generate cookie, tags_and_pageref#'tags' as score_tag_bag;
flattened_data = FOREACH tags_data GENERATE cookie, FLATTEN(score_tag_bag) as score_tag;
converted_flattened_data = FOREACH flattened_data GENERATE cookie, (long)score_tag#'score' as score, score_tag#'tag' as tag;
-- dump converted_flattened_data;
-- tuple_data = FOREACH flattened_data GENERATE cookie, TOTUPLE(tags) as tag_tuple;
-- splited_data = FOREACH flattened_data GENERATE cookie, TOKENIZE(tags) as score_tags:bag{t1:(),t2:()};
grouped_data = group converted_flattened_data by (cookie,tag);
acc_data = foreach grouped_data generate group.cookie as cookie, group.tag as tag,SUM(converted_flattened_data.score) as score;
pageref_data = foreach data generate cookie, tags_and_pageref#'pageref' as pageref_bag;
flattened_pageref_data = FOREACH pageref_data GENERATE cookie, FLATTEN(pageref_bag) as score_tag;
filtered_data = FILTER flattened_pageref_data BY score_tag is not null and not IsEmpty(score_tag);
store acc_data into 'segmentation/2016-05-28/4' using PigStorage(',');
store filtered_data into 'pagerefdata/2016-05-28/4' using PigStorage(',');
但是 pagerefdata 的输出都是空文件。怎么过滤呢,都是空的,不想输出。
提前致谢。
不确定我是否完全理解您的问题,但是从底部开始的 3 行您似乎试图通过测试 null 来过滤掉空行。您是否尝试过以下方法?:
FILTER flattened_pageref_data BY SIZE(TRIM(score_tag)) == 0;
这是我的猪脚本。
data = load 's3a://sessionlog/2016-05-28/' using SegmentationDataLoader() as (cookie:chararray,tags_and_pageref:map[]);
tags_data = foreach data generate cookie, tags_and_pageref#'tags' as score_tag_bag;
flattened_data = FOREACH tags_data GENERATE cookie, FLATTEN(score_tag_bag) as score_tag;
converted_flattened_data = FOREACH flattened_data GENERATE cookie, (long)score_tag#'score' as score, score_tag#'tag' as tag;
-- dump converted_flattened_data;
-- tuple_data = FOREACH flattened_data GENERATE cookie, TOTUPLE(tags) as tag_tuple;
-- splited_data = FOREACH flattened_data GENERATE cookie, TOKENIZE(tags) as score_tags:bag{t1:(),t2:()};
grouped_data = group converted_flattened_data by (cookie,tag);
acc_data = foreach grouped_data generate group.cookie as cookie, group.tag as tag,SUM(converted_flattened_data.score) as score;
pageref_data = foreach data generate cookie, tags_and_pageref#'pageref' as pageref_bag;
flattened_pageref_data = FOREACH pageref_data GENERATE cookie, FLATTEN(pageref_bag) as score_tag;
filtered_data = FILTER flattened_pageref_data BY score_tag is not null and not IsEmpty(score_tag);
store acc_data into 'segmentation/2016-05-28/4' using PigStorage(',');
store filtered_data into 'pagerefdata/2016-05-28/4' using PigStorage(',');
但是 pagerefdata 的输出都是空文件。怎么过滤呢,都是空的,不想输出。
提前致谢。
不确定我是否完全理解您的问题,但是从底部开始的 3 行您似乎试图通过测试 null 来过滤掉空行。您是否尝试过以下方法?:
FILTER flattened_pageref_data BY SIZE(TRIM(score_tag)) == 0;