PIG 中的 NOT IN 子句

NOT IN clause in PIG

我正在尝试

select * from A where A.ID NOT IN (select id from B) (in sql)

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c= FOREACH destnew GENERATE ID;
D=FILTER sourcenew BY NOT ID (c.ID);
 org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1.
Was expecting one of:
<EOF> 
"cat" ...
"clear" ...<EOF>

解决错误的任何帮助,在执行最后一行时得到它。

使用 LEFT OUTER JOIN 并过滤空值

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c = FOREACH destnew GENERATE ID;
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID;
e = FILTER d by destnew.ID is null;

注意 我用几个测试文件写了一个示例脚本,下面是工作 solution.In 你检查一下你是否从你的文件正确加载数据。

test1.txt

1   abc
2   def
3   ghi
4   jkl
5   mno
6   pqr
7   stu
8   vwx
1   abc
2   def
3   ghi
4   jkl
1   abc
2   def
3   ghi
1   abc
2   def

test2.txt

1
2
3
4

脚本

A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray);
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int);
C = JOIN A BY aid LEFT OUTER,B BY bid;
D = FILTER C BY bid is null;
DUMP D;

因此在上面的示例中,记录 5、6、7、8 应该在结果中,因为这些 ID 不在 test2.txt。