如何在 Hadoop 上的 Apache Pig 中基于两个子包构建一个超级包
how to build a super bag based on two sub-bags in Apache Pig on Hadoop
假设我有两个包,B1 和 B2,我想知道如何制作一个包含这两个包的超级包?我想要一个包含两个子包的超级包的目的是因为我想调用datafu的UDF SetDifference,这似乎是在一个包含两个包的关系上调用的?
就我而言,我已经有两个包,B1 和 B2。我想我需要assemble一个超级包"input"在这个样本中。
http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html
differenced = FOREACH input {
-- input bags must be sorted
sorted_b1 = ORDER B1 by val;
sorted_b2 = ORDER B2 by val;
GENERATE SetDifference(sorted_b1,sorted_b2);
}
更新:
这是我的代码和相关错误信息,如果有人有什么好的想法,那就太好了。
register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();
-- input1.txt: {(3),(4),(1),(2),(7),(5),(6)}
-- input2.txt: {(1),(3),(5),(12)}
A = load 'input1.txt' AS (B1:bag{T:tuple(val:int)});
B = load 'input2.txt' AS (B1:bag{T:tuple(val:int)});
sorted_b1 = ORDER A by val;
sorted_b2 = ORDER B by val;
differenced = setDifference(sorted_b1,sorted_b2);
-- expected produces: ({(2),(4),(6),(7)})
DUMP differenced;
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file TestDataFu3.pig, line 11> Cannot expand macro 'setDifference'. Reason: Macro must be defined before expansion.
提前致谢,
林
好的,我知道你在问什么了;你的包在不同的文件中。您将需要导入然后加入它们,以便它们处于相同的关系中。
脚本:
REGISTER /path/to/jars/datafu-1.2.0.jar;
DEFINE SetDifference datafu.pig.sets.SetDifference();
data1 = LOAD 'input1' AS (B1:bag{T1:tuple(val1:int)});
data2 = LOAD 'input2' AS (B2:bag{T2:tuple(val2:int)});
A = JOIN data1 BY 1, data2 BY 1;
diff = FOREACH A {
S1 = ORDER B1 BY val1;
S2 = ORDER B2 BY val2;
GENERATE SetDifference(S1, S2);
};
DUMP A;
输出:
({(2),(4),(6),(7)})
希望对您有所帮助。
假设我有两个包,B1 和 B2,我想知道如何制作一个包含这两个包的超级包?我想要一个包含两个子包的超级包的目的是因为我想调用datafu的UDF SetDifference,这似乎是在一个包含两个包的关系上调用的?
就我而言,我已经有两个包,B1 和 B2。我想我需要assemble一个超级包"input"在这个样本中。
http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html
differenced = FOREACH input {
-- input bags must be sorted
sorted_b1 = ORDER B1 by val;
sorted_b2 = ORDER B2 by val;
GENERATE SetDifference(sorted_b1,sorted_b2);
}
更新:
这是我的代码和相关错误信息,如果有人有什么好的想法,那就太好了。
register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();
-- input1.txt: {(3),(4),(1),(2),(7),(5),(6)}
-- input2.txt: {(1),(3),(5),(12)}
A = load 'input1.txt' AS (B1:bag{T:tuple(val:int)});
B = load 'input2.txt' AS (B1:bag{T:tuple(val:int)});
sorted_b1 = ORDER A by val;
sorted_b2 = ORDER B by val;
differenced = setDifference(sorted_b1,sorted_b2);
-- expected produces: ({(2),(4),(6),(7)})
DUMP differenced;
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file TestDataFu3.pig, line 11> Cannot expand macro 'setDifference'. Reason: Macro must be defined before expansion.
提前致谢, 林
好的,我知道你在问什么了;你的包在不同的文件中。您将需要导入然后加入它们,以便它们处于相同的关系中。
脚本:
REGISTER /path/to/jars/datafu-1.2.0.jar;
DEFINE SetDifference datafu.pig.sets.SetDifference();
data1 = LOAD 'input1' AS (B1:bag{T1:tuple(val1:int)});
data2 = LOAD 'input2' AS (B2:bag{T2:tuple(val2:int)});
A = JOIN data1 BY 1, data2 BY 1;
diff = FOREACH A {
S1 = ORDER B1 BY val1;
S2 = ORDER B2 BY val2;
GENERATE SetDifference(S1, S2);
};
DUMP A;
输出:
({(2),(4),(6),(7)})
希望对您有所帮助。