如何使用 apache pig 将一个袋子转换成多个袋子?
How to convert a bag into multiple bags using apache pig?
我有一个包含两组数据的文件,如下所示:
1,abc,10,dss
2,efgh,as
1,abc,10,1234
2,efgh,as
1,abc,10,7899
2,efgh,as
以#1 开头的记录是一组,以#2 开头的记录是不同的一组。所以两者有不同的结构。如何将这两组记录分开?
这里有一个方法...
A = LOAD '/user/data/split.txt' as line:chararray;
B = FOREACH A GENERATE Flatten(TOKENIZE(line,' ')) ;
B1 = filter B by [=10=] matches '1.*';
B2 = filter B by [=10=] matches '2.*';
DUMP B1
DUMP B2
or
SPLIT B INTO B1 IF ([=10=] matches '1.*'), B2 IF ([=10=] matches '2.*');
有了新的更新版本的输入,这是其他解决方案
A = LOAD '/user/data/split.txt' as line:chararray;
B1 = filter A by [=10=] matches '1.*';
B2 = filter A by [=10=] matches '2.*';
or
SPLIT A INTO B1 IF ([=10=] matches '1.*'), B2 IF ([=10=] matches '2.*');
我有一个包含两组数据的文件,如下所示:
1,abc,10,dss
2,efgh,as
1,abc,10,1234
2,efgh,as
1,abc,10,7899
2,efgh,as
以#1 开头的记录是一组,以#2 开头的记录是不同的一组。所以两者有不同的结构。如何将这两组记录分开?
这里有一个方法...
A = LOAD '/user/data/split.txt' as line:chararray;
B = FOREACH A GENERATE Flatten(TOKENIZE(line,' ')) ;
B1 = filter B by [=10=] matches '1.*';
B2 = filter B by [=10=] matches '2.*';
DUMP B1
DUMP B2
or
SPLIT B INTO B1 IF ([=10=] matches '1.*'), B2 IF ([=10=] matches '2.*');
有了新的更新版本的输入,这是其他解决方案
A = LOAD '/user/data/split.txt' as line:chararray;
B1 = filter A by [=10=] matches '1.*';
B2 = filter A by [=10=] matches '2.*';
or
SPLIT A INTO B1 IF ([=10=] matches '1.*'), B2 IF ([=10=] matches '2.*');