如何在应用 JOIN 的两个不同表上同时使用 MAX 和 COUNT 函数?
How to use MAX and COUNT function simultaneously on two different tables which are applied with a JOIN?
//Pig Program
User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray, score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray, commentcount:chararray);
JOIN User BY id, Post BY id;
a = JOIN User BY id, Post BY id;
DUMP a;
User_Group = Group a ALL;
Max_reputation = foreach User_Group Generate(User.displayname, User.reputation, Post.id), MAX(User.reputation), COUNT(Post.id);
所以基本上我将两个不同的表分组,即 User 和 Post
然后对其应用 JOIN。
问题陈述:查找具有最高声誉的用户的显示名称和帖子数。
所以基本上我需要来自用户的显示名称和声誉
还有来自 Post
的 id
我想在 JOIN 上应用 MAX(User.reputation) 和 Count(Post.id),即 a
请帮忙。
更有用的是应用 JOIN 然后执行 MAX 和 Count 或
应用 MAX 和 Count,然后进行 JOIN。
问题陈述:查找具有最高声誉的 post 用户的显示名称和编号。
首先尝试在关系 "user"
的帮助下找到具有最高声誉的用户的显示名称
然后应用与关系 "post" 的连接以收集该最大用户的所有 post。然后根据 id 应用分组并计算计数。
以下代码将帮助您实现目标
User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray,score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray);
User_grp = GROUP User BY id;
User_each = FOREACH User_grp
{
User_order = ORDER User BY reputation DESC;
User_limit = LIMIT User_order 1;
User_nested = FOREACH User_limit GENERATE id,displayname;
GENERATE flatten(user_nested) as (user_id,displayname);
};
User_join = JOIN User_each by user_id, Post by id;
User_grouping = GROUP User_join BY user_id;
User_output = FOREACH User_grouping GENERATE group as user_id, MAX(user_join.displayname) as displayname, COUNT(user_join.post_type) as post_cnts;
//Pig Program
User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray, score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray, commentcount:chararray);
JOIN User BY id, Post BY id;
a = JOIN User BY id, Post BY id;
DUMP a;
User_Group = Group a ALL;
Max_reputation = foreach User_Group Generate(User.displayname, User.reputation, Post.id), MAX(User.reputation), COUNT(Post.id);
所以基本上我将两个不同的表分组,即 User 和 Post 然后对其应用 JOIN。
问题陈述:查找具有最高声誉的用户的显示名称和帖子数。
所以基本上我需要来自用户的显示名称和声誉
还有来自 Post
的 id我想在 JOIN 上应用 MAX(User.reputation) 和 Count(Post.id),即 a
请帮忙。
更有用的是应用 JOIN 然后执行 MAX 和 Count 或 应用 MAX 和 Count,然后进行 JOIN。
问题陈述:查找具有最高声誉的 post 用户的显示名称和编号。
首先尝试在关系 "user"
的帮助下找到具有最高声誉的用户的显示名称然后应用与关系 "post" 的连接以收集该最大用户的所有 post。然后根据 id 应用分组并计算计数。
以下代码将帮助您实现目标
User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray,score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray);
User_grp = GROUP User BY id;
User_each = FOREACH User_grp
{
User_order = ORDER User BY reputation DESC;
User_limit = LIMIT User_order 1;
User_nested = FOREACH User_limit GENERATE id,displayname;
GENERATE flatten(user_nested) as (user_id,displayname);
};
User_join = JOIN User_each by user_id, Post by id;
User_grouping = GROUP User_join BY user_id;
User_output = FOREACH User_grouping GENERATE group as user_id, MAX(user_join.displayname) as displayname, COUNT(user_join.post_type) as post_cnts;