在 PIG 中合并多重关系
Merging Multiple Relation in PIG
大家好,我正在尝试解决这个问题,我想知道是否有一个文件具有这样的属性:
(id#123, event#sasa, value#abcde, time#213, userid#21321)
要获得我会做的总数据:
data_count = foreach (group data all) generate count(data);
要获得我会做的总用户数:
group_users = GROUP data BY userid;
grp_all = GROUP group_users ALL;
count_users = FOREACH grp_all GENERATE COUNT(group_users);
现在我想知道如何将它们合并到 1 个输出
的文件中
(id, event, value, time, total data,total users)
非常感谢。
不确定什么是总数据,但如果您想返回到原始行的总用户数,则需要使用 FLATTEN 几次。 PIG 不是 SQL,它适用于 BAG,FLATEN 将 BAG 转换回行。例如:
data = load './data.csv' using PigStorage(',') as (e_id, e_name,value,time,userid);
group_users = GROUP data BY userid;
grp_all = GROUP group_users ALL;
DESCRIBE grp_all;
-- grp_all: {group: chararray,group_users: {(group: bytearray,data: {(e_id: bytearray,e_name: bytearray,value: bytearray,time: bytearray,userid: bytearray)})}}
uniq_users = FOREACH grp_all GENERATE FLATTEN(group_users), COUNT(group_users) as total_users;
describe uniq_users;
-- uniq_users: {group_users::group: bytearray,group_users::data: {(e_id: bytearray,e_name: bytearray,value: bytearray,time: bytearray,userid: bytearray)},total_users: long}
original = FOREACH uniq_users GENERATE FLATTEN(data), total_users;
describe original;
-- original: {group_users::data::e_id: bytearray,group_users::data::e_name: bytearray,group_users::data::value: bytearray,group_users::data::time: bytearray,group_users::data::userid: bytearray,total_users: long}
DUMP original;
我是用这个脚本做的:
d1 = LOAD 'data' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
d2 = foreach d1 generate
json#'event' AS EVENT,
json#'params'#'uid' AS USER,
ToDate(((long)json#'ts')*1000) AS DATE;
grpd = group d2 by EVENT;
uniq2 = foreach grpd {
usr = d2.USER;
unq_usr = distinct usr;
generate group,
d2.DATE,
COUNT(d2.EVENT),
COUNT(unq_usr);
};
大家好,我正在尝试解决这个问题,我想知道是否有一个文件具有这样的属性:
(id#123, event#sasa, value#abcde, time#213, userid#21321)
要获得我会做的总数据:
data_count = foreach (group data all) generate count(data);
要获得我会做的总用户数:
group_users = GROUP data BY userid;
grp_all = GROUP group_users ALL;
count_users = FOREACH grp_all GENERATE COUNT(group_users);
现在我想知道如何将它们合并到 1 个输出
的文件中(id, event, value, time, total data,total users)
非常感谢。
不确定什么是总数据,但如果您想返回到原始行的总用户数,则需要使用 FLATTEN 几次。 PIG 不是 SQL,它适用于 BAG,FLATEN 将 BAG 转换回行。例如:
data = load './data.csv' using PigStorage(',') as (e_id, e_name,value,time,userid);
group_users = GROUP data BY userid;
grp_all = GROUP group_users ALL;
DESCRIBE grp_all;
-- grp_all: {group: chararray,group_users: {(group: bytearray,data: {(e_id: bytearray,e_name: bytearray,value: bytearray,time: bytearray,userid: bytearray)})}}
uniq_users = FOREACH grp_all GENERATE FLATTEN(group_users), COUNT(group_users) as total_users;
describe uniq_users;
-- uniq_users: {group_users::group: bytearray,group_users::data: {(e_id: bytearray,e_name: bytearray,value: bytearray,time: bytearray,userid: bytearray)},total_users: long}
original = FOREACH uniq_users GENERATE FLATTEN(data), total_users;
describe original;
-- original: {group_users::data::e_id: bytearray,group_users::data::e_name: bytearray,group_users::data::value: bytearray,group_users::data::time: bytearray,group_users::data::userid: bytearray,total_users: long}
DUMP original;
我是用这个脚本做的:
d1 = LOAD 'data' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
d2 = foreach d1 generate
json#'event' AS EVENT,
json#'params'#'uid' AS USER,
ToDate(((long)json#'ts')*1000) AS DATE;
grpd = group d2 by EVENT;
uniq2 = foreach grpd {
usr = d2.USER;
unq_usr = distinct usr;
generate group,
d2.DATE,
COUNT(d2.EVENT),
COUNT(unq_usr);
};