PIG UDF 将元组转换为多元组输出

PIG UDF to convert tuple to multiple tuple output

我是 PIG 的新手,我正在尝试创建一个 UDF,它获取一个元组和 return 基于分隔符的多个元组。所以我写了一个UDF来读取下面的数据文件

2012/01/01 Name1 Category1|Category2|Category3
2012/01/01 Name2 Category2|Category3
2012/01/01 Name3 Category1|Category5

基本上我正在尝试读取 $2 字段

Category1|Category2|Category3
Category2|Category3
Category1|Category5

得到输出为:-

Category1, Category2, Category3
Category2, Category3
Category1, Category5

下面是我写的UDF代码..

    package com.test.multipleTuple;    
    import java.io.IOException;
    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;
    import org.apache.pig.data.TupleFactory;

    public class TupleToMultipleTuple extends EvalFunc<String> {

        @Override
        public String exec(Tuple input) throws IOException {

            // Keep the count of every cell in the
            Tuple aux = TupleFactory.getInstance().newTuple();

            if (input == null || input.size() == 0)
                return null;
            try {
                String del = "\|";
                String str = (String) input.get(0);

                String field[] = str.split(del);
                for (String nxt : field) {
                    aux.append(nxt.trim().toString());
                }
            } catch (Exception e) {
                throw new IOException("Caught exception processing input row ", e);
            }

            return aux.toDelimitedString(",");
        }
    }

创建了 Jar --> TupleToMultipleTuple.jar

但是我在执行时遇到以下错误。

 Pig Stack Trace
    ---------------
    ERROR 1066: Unable to open iterator for alias B

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B
        at org.apache.pig.PigServer.openIterator(PigServer.java:892)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
        at org.apache.pig.Main.run(Main.java:547)
        at org.apache.pig.Main.main(Main.java:158)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
    Caused by: java.io.IOException: Job terminated with anomalous status FAILED
        at org.apache.pig.PigServer.openIterator(PigServer.java:884)
        ... 13 more

你能帮我解决这个问题吗?谢谢。

用于应用 UDF 的 Pig 脚本..

REGISTER TupleToMultipleTuple.jar;
DEFINE myFunc com.test.multipleTuple.TupleToMultipleTuple();
A = load 'data.txt' USING PigStorage(' ');
B = foreach A generate myFunc();
dump B;

您可以像这样使用内置拆分功能:

flatten(STRSPLIT(,'[|]',3))as(cat1:chararray,cat2:chararray,cat3:chararray)

你会得到 3 个元组,分别命名为 cat1cat2cat2,类型为 chararray,并由它们所属关系的当前分隔符分隔。

发现问题.. 问题是在将 DayaByteArray 解析为 String 时.. 使用 toString() 修复它

package com.test.multipleTuple;    
    import java.io.IOException;
    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;
    import org.apache.pig.data.TupleFactory;

    public class TupleToMultipleTuple extends EvalFunc<String> {

        @Override
        public String exec(Tuple input) throws IOException {

            // Keep the count of every cell in the
            Tuple aux = TupleFactory.getInstance().newTuple();

            if (input == null || input.size() == 0)
                return null;
            try {
                String del = "\|";
                String str = (String) input.get(0).toString();

                String field[] = str.split(del);
                for (String nxt : field) {
                    aux.append(nxt.trim().toString());
                }
            } catch (Exception e) {
                throw new IOException("Caught exception processing input row ", e);
            }

            return aux.toDelimitedString(",");
        }
    }