如何获取 PIG 中一组字段的 DISTINCT 值?
How to get DISTINCT values of a group of fields in PIG?
是否可以在 PIG 中获得以下输出?我可以使用 Group by 1st 和 2nd 字段,然后在 3rd 字段上执行 DISTINCT 吗?
For example
I have input data
12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
I want output something like
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
方法 1:使用 DISTINCT
参考: http://pig.apache.org/docs/r0.12.0/basic.html#distinct
DISTINCT 运算符应该有所帮助
test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;
方法 2:GROUP BY 所有字段
test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ([=11=],,);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;
两种方法都给出了共享输入的预期输出。
尝试 this,它非常相似:
A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
unique =
FOREACH (GROUP A BY a3) {
b = A.(a1,a2);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};
是否可以在 PIG 中获得以下输出?我可以使用 Group by 1st 和 2nd 字段,然后在 3rd 字段上执行 DISTINCT 吗?
For example
I have input data
12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
I want output something like
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
方法 1:使用 DISTINCT
参考: http://pig.apache.org/docs/r0.12.0/basic.html#distinct
DISTINCT 运算符应该有所帮助
test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;
方法 2:GROUP BY 所有字段
test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ([=11=],,);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;
两种方法都给出了共享输入的预期输出。
尝试 this,它非常相似:
A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
unique =
FOREACH (GROUP A BY a3) {
b = A.(a1,a2);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};