如何执行 Group by 然后在 pig 的其他列上使用 DISTINCT

How to perform Group by then use DISTINCT on other column in pig

我刚刚开始学习 PIG,需要一些帮助解决以下问题。提前致谢!
例如:我有这样的输入:

职业类别名称

Actress       Acting     Marion Cotillard
Actor         Acting     Liam Nelson
Tennis Plyr   Athletics  Roger Federer
Football Plyr Athletics  Neymar
Actor         Acting     Tom Hanks
Actress       Acting     Elizabeth Banks
US Senator    Politics   Elizabeth Warren
Football Plyr Athletics  Mesut Ozil

我想知道一个类目有多少种。 例如:- 表演有两种类型,一种是女演员,另一种是演员。因此,结果将为 2。 面临的问题:无法使用 'Occupation' 列区分 'group by Category' 的输出。 :(

Distinct first and Group By Category.Assuming 你已经将数据加载到关系 A.

Select 加载后的 2 列。

区分关系

按类别分组

计算每个类别的职业

B = FOREACH A GENERATE Occupation as Occupation,Category as Category;
C = DISTINCT B;
D = GROUP C BY ; 
E = FOREACH D GENERATE group,COUNT(C.Occupation); 
DUMP E;

试试这个:

x= load '<data>' using PigStorage('\t') as (occupation:chararray,category:chararray,name:chararray);

 x_grouped= group x by category;

x_grouped_distinct= foreach x_grouped { cat= distinct .occupation; generate [=10=], cat, COUNT(cat);}; 

dump x_grouped_distinct;