猪在分组

Pig on grouping

我试着计算每个 member_id 出现的次数。 数据看起来像:(member_id, item_type)

2020292 美国广播公司

2020292 Acd

2020292 美国广播公司

2938201 CDE

那么输出将类似于 (id, count):

2020292 3

2938201 1

我尝试了以下方法:

data=FOREACH data GENERATE member_id, item_type;
grouping=group data by member_id;
count_elements=foreach grouping generate flatten(group) as member_id, COUNT(data) as num_elements;

我也为 count_elements 尝试过类似的代码,例如 'foreach grouping generate member_id, COUNT(data) as num_elements;' 和 'foreach grouping generate flatten(group) as member_id, COUNT(data.item_type) as num_elements;' 并且没有一个在工作。 任何帮助是极大的赞赏。 谢谢。

输入:

2020292,Abc
2020292,Acd
2020292,Abc
2938201,CDE

代码:

read = load 'test.data' using PigStorage(',') as (id:int,item_typ:chararray);
grouped_Data = group read by id;
describe grouped_Data;
count_val = foreach grouped_Data GENERATE group as (member_id:int),COUNT(read) as (rec_cnt:int);
dump count_val;

输出:

(2020292,3)
(2938201,1)

珍妮,我为你的问题添加了代码,也为你在上面的评论中提出的问题添加了代码(@Learner 的回答)。

输入数据:

2020292,Abc
2020292,Acd
2020292,Abc
2938201,CDE

id_list的示例数据:

2020292
2020291
2020290

猪脚本:

data = LOAD '/pigsamples/groupdata' USING PigStorage(',') 
       AS (member_id:INT, item_type:CHARARRAY);
id_list_data = LOAD '/pigsamples/groupidlist' USING PigStorage(',') AS (member_id:INT);

group_data = GROUP data BY member_id;
count_grouped_data = FOREACH group_data GENERATE group AS member_id, COUNT(data) AS count;

join_data = JOIN count_grouped_data BY member_id, id_list_data BY member_id;

group_joined_data = FOREACH join_data GENERATE count_grouped_data::member_id 
                    AS id, count_grouped_data::count AS count_item_type;

输出:

(2020292,3)