按两个字段分组的猪会产生奇怪的结果
Pig group by two fields yields strange result
我在使用 apache-pig 分组数据时遇到问题。
加载数据:
client_trace_send = LOAD '/user/hduser1/adm_project/client_trace50.csv' using PigStorage(',') as (code:chararray, client_id:int, loc_ts:int, length:int, op:chararray, err_code:int, time:long, thread_id:INT);
限制和检查数据:
client_trace_send_small = LIMIT client_trace_send 10;
DUMP client_trace_send_small;
已加载数据:
(msg_snd,0,1,46,enrol_req,-1,1414250523591,9)
(res_rcv,0,1,25,enrol_resp,,1414250523655,9)
(msg_snd,1,2,48,query_queue,-1,1414250523655,9)
(res_rcv,1,2,14,err,19,1414250523661,9)
(msg_snd,1,3,59,peek_req,-1,1414250523661,9)
(res_rcv,1,3,13,err,0,1414250523662,9)
(msg_snd,1,4,59,peek_req,-1,1414250523662,9)
(res_rcv,1,4,13,err,0,1414250523663,9)
(msg_snd,1,5,59,peek_req,-1,1414250523663,9)
现在,我想在字段 "client_id" 和 "loc_ts" 上对上述数据进行分组。
GROUPED = GROUP client_trace_send_small by (client_id,loc_ts);
查看结果:
DUMP GROUPED;
奇怪的是:
((0,1),{(msg_snd,0,1,46,enrol_req,-1,1414250523591,9)})
((1,2),{(msg_snd,1,2,48,query_queue,-1,1414250523655,9)})
((1,3),{(msg_snd,1,3,59,peek_req,-1,1414250523661,9)})
((1,4),{(msg_snd,1,4,59,peek_req,-1,1414250523662,9)})
((1,5),{(msg_snd,1,5,59,peek_req,-1,1414250523663,9)})
((8,28493),{(msg_snd,8,28493,62,pop_req,-1,1414251764157,16)})
((9,25976),{(msg_snd,9,25976,66,query_sender,-1,1414251764148,17)})
((19,28250),{(msg_snd,19,28250,64,pop_req,-1,1414251764152,27)})
((31,27977),{(msg_snd,31,27977,65,peek_req,-1,1414251764152,39)})
有些值甚至没有出现在加载的数据中。
对于第一组,我希望是这样的:
((0,1),{(msg_snd,0,1,46,enrol_req,-1,1414250523591,9),{(res_rcv,0,1,25,enrol_resp,,1414250523655,9)})
这里出了什么问题?
感谢您的帮助,在此先致谢。
此致
这是因为LIMIT取了一组随机数据:
There is no guarantee which tuples will be returned, and the tuples
that are returned can change from one run to the next.
并且因为你的scipt中有两个DUMP
,所以Pig会把pipeline分成两个pipeline,每个pipeline分别执行LIMIT
。因此,您将为每个子管道获得两个不同的数据集。
但我们可以确定的是,转储数据显然来自输入文件,并且它们最多来自 10 行(如果文件行数较少,则为 <10)。
您可以查看您的explain plan and it's explanation。
我在使用 apache-pig 分组数据时遇到问题。
加载数据:
client_trace_send = LOAD '/user/hduser1/adm_project/client_trace50.csv' using PigStorage(',') as (code:chararray, client_id:int, loc_ts:int, length:int, op:chararray, err_code:int, time:long, thread_id:INT);
限制和检查数据:
client_trace_send_small = LIMIT client_trace_send 10;
DUMP client_trace_send_small;
已加载数据:
(msg_snd,0,1,46,enrol_req,-1,1414250523591,9)
(res_rcv,0,1,25,enrol_resp,,1414250523655,9)
(msg_snd,1,2,48,query_queue,-1,1414250523655,9)
(res_rcv,1,2,14,err,19,1414250523661,9)
(msg_snd,1,3,59,peek_req,-1,1414250523661,9)
(res_rcv,1,3,13,err,0,1414250523662,9)
(msg_snd,1,4,59,peek_req,-1,1414250523662,9)
(res_rcv,1,4,13,err,0,1414250523663,9)
(msg_snd,1,5,59,peek_req,-1,1414250523663,9)
现在,我想在字段 "client_id" 和 "loc_ts" 上对上述数据进行分组。
GROUPED = GROUP client_trace_send_small by (client_id,loc_ts);
查看结果:
DUMP GROUPED;
奇怪的是:
((0,1),{(msg_snd,0,1,46,enrol_req,-1,1414250523591,9)})
((1,2),{(msg_snd,1,2,48,query_queue,-1,1414250523655,9)})
((1,3),{(msg_snd,1,3,59,peek_req,-1,1414250523661,9)})
((1,4),{(msg_snd,1,4,59,peek_req,-1,1414250523662,9)})
((1,5),{(msg_snd,1,5,59,peek_req,-1,1414250523663,9)})
((8,28493),{(msg_snd,8,28493,62,pop_req,-1,1414251764157,16)})
((9,25976),{(msg_snd,9,25976,66,query_sender,-1,1414251764148,17)})
((19,28250),{(msg_snd,19,28250,64,pop_req,-1,1414251764152,27)})
((31,27977),{(msg_snd,31,27977,65,peek_req,-1,1414251764152,39)})
有些值甚至没有出现在加载的数据中。 对于第一组,我希望是这样的:
((0,1),{(msg_snd,0,1,46,enrol_req,-1,1414250523591,9),{(res_rcv,0,1,25,enrol_resp,,1414250523655,9)})
这里出了什么问题?
感谢您的帮助,在此先致谢。
此致
这是因为LIMIT取了一组随机数据:
There is no guarantee which tuples will be returned, and the tuples that are returned can change from one run to the next.
并且因为你的scipt中有两个DUMP
,所以Pig会把pipeline分成两个pipeline,每个pipeline分别执行LIMIT
。因此,您将为每个子管道获得两个不同的数据集。
但我们可以确定的是,转储数据显然来自输入文件,并且它们最多来自 10 行(如果文件行数较少,则为 <10)。
您可以查看您的explain plan and it's explanation。