处理 Pig Join 上的 One:Many 关系
Handling One:Many Relations on Pig Join
我在 Pig 中有两个这样的关系:
rel_A: {key: chararray, some_string: chararray, some_metric: long}
rel_B: {key: chararray, some_metric2: long}
所以例如 rel_A 可能看起来像
{('A', 'aaa', 1)
('A', 'aab', 2)
('B', 'aaa', 3)
('B', 'bbb', 1)
('C', 'whatever', 5)}
而 rel_B 可能看起来像
{('A', 100)
('B', 250)
('C', 0)}
我想加入他们,这样我就能得到:
{('A', 'aaa', 1, 100)
('A', 'aab', 2, 100)
('B', 'aaa', 3, 250)
('B', 'bbb', 1, 250)
('C', 'whatever', 5, 0)}
这对我来说在概念上似乎很简单,它似乎只是一个左外连接,但是当我尝试以下操作时 运行 遇到了问题:
joined_thing = JOIN rel_A BY key LEFT OUTER, rel_B BY key;
--The error appears here
agged_flat = FOREACH joined_thing GENERATE rel_A::key as key,
rel_A::some_string as some_string,
rel_A::some_metric as some_metric,
rel_B::some_metric2 as some_metric2;
这抛出:
Error: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (A,aaa,1), 2nd : (A,aab,2)
我确定我在这里遗漏了一些概念性的基础知识,但我一直很困惑地试图完成这项工作,非常感谢任何帮助!
已解决。以上将起作用。具体错误来自
{('A', 'aaa', 1),
('A', 'aab', 2),
('B', 'aaa', 3),
('B', 'bbb', 1),
('C', 'whatever', 5)}
实际显示为嵌套元组,如下所示:
{(('A', 'aaa', 1)
('A', 'aab', 2)),
('B', 'aaa', 3),
('B', 'bbb', 1),
('C', 'whatever', 5)}
尽管提到的脚本之前只有负载,但其中一个使用了一个自定义库,该库实际上有一个我未能检测到的错误。抱歉!
按预期工作,可能是加载错误
rel_A = LOAD '/user/data/A_rel.txt' USING PigStorage(',') as (key: chararray, some_string: chararray, some_metric: long);
rel_B = LOAD '/user/data/B_rel.txt' USING PigStorage(',') as (key: chararray, some_metric2: long);
joined_thing = JOIN rel_A BY key LEFT OUTER, rel_B BY key;
(A,aab,2,A,100)
(A,aaa,1,A,100)
(B,bbb,1,B,250)
(B,aaa,3,B,250)
(C,whatever,5,C,0)
我在 Pig 中有两个这样的关系:
rel_A: {key: chararray, some_string: chararray, some_metric: long}
rel_B: {key: chararray, some_metric2: long}
所以例如 rel_A 可能看起来像
{('A', 'aaa', 1)
('A', 'aab', 2)
('B', 'aaa', 3)
('B', 'bbb', 1)
('C', 'whatever', 5)}
而 rel_B 可能看起来像
{('A', 100)
('B', 250)
('C', 0)}
我想加入他们,这样我就能得到:
{('A', 'aaa', 1, 100)
('A', 'aab', 2, 100)
('B', 'aaa', 3, 250)
('B', 'bbb', 1, 250)
('C', 'whatever', 5, 0)}
这对我来说在概念上似乎很简单,它似乎只是一个左外连接,但是当我尝试以下操作时 运行 遇到了问题:
joined_thing = JOIN rel_A BY key LEFT OUTER, rel_B BY key;
--The error appears here
agged_flat = FOREACH joined_thing GENERATE rel_A::key as key,
rel_A::some_string as some_string,
rel_A::some_metric as some_metric,
rel_B::some_metric2 as some_metric2;
这抛出:
Error: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (A,aaa,1), 2nd : (A,aab,2)
我确定我在这里遗漏了一些概念性的基础知识,但我一直很困惑地试图完成这项工作,非常感谢任何帮助!
已解决。以上将起作用。具体错误来自
{('A', 'aaa', 1),
('A', 'aab', 2),
('B', 'aaa', 3),
('B', 'bbb', 1),
('C', 'whatever', 5)}
实际显示为嵌套元组,如下所示:
{(('A', 'aaa', 1)
('A', 'aab', 2)),
('B', 'aaa', 3),
('B', 'bbb', 1),
('C', 'whatever', 5)}
尽管提到的脚本之前只有负载,但其中一个使用了一个自定义库,该库实际上有一个我未能检测到的错误。抱歉!
按预期工作,可能是加载错误
rel_A = LOAD '/user/data/A_rel.txt' USING PigStorage(',') as (key: chararray, some_string: chararray, some_metric: long);
rel_B = LOAD '/user/data/B_rel.txt' USING PigStorage(',') as (key: chararray, some_metric2: long);
joined_thing = JOIN rel_A BY key LEFT OUTER, rel_B BY key;
(A,aab,2,A,100)
(A,aaa,1,A,100)
(B,bbb,1,B,250)
(B,aaa,3,B,250)
(C,whatever,5,C,0)