Apache Pig 查询 - 数据集加入错误 1031
Apache Pig Query - Dataset Joins ERROR 1031
我有以下四个任务要处理,但对如何连接两个数据集以使任何任务正常工作感到困惑...
A) 查询成交次数最少的客户名称,输出客户名称,成交次数。
B) 使用广播(复制)加入加入客户和交易。报表:CustomerID、Name、Salary、NumOf Transactions、TotalSum、MinItems(其中 NumOfTransactions 是客户完成的交易总数,TotalSum 是该客户字段“TransTotal”的总和,MinItems 是最小项目数客户完成的交易。)
C) 报告客户数量大于 5,000 或小于 2,000 的国家/地区代码。
D) 假设我们要对数据设计一个分析任务如下:年龄属性分为六组,分别是[10, 20), [20, 30], [30, 40), [40、50]、[50、60) 和 [60、70]。在上述每个年龄段内,再根据“性别”进行进一步划分,即,将6个年龄段中的每一个进一步划分为两组。每个组报告:年龄范围、性别、MinTransTotal、MaxTransTotal、AvgTransTotal。注:括号“[”表示包含范围下限,“)”表示不包含范围上限。
这是我的开头:
hadoop fs -mkdir /piginput
sudo hadoop fs -put customer.txt /piginput
sudo hadoop fs -put transaction.txt /piginput
sudo hadoop fs -put transaction_small.txt /piginput
pig
customers = LOAD '/piginput/customers.txt' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,CountryCode:int,salary:float);
transactions = LOAD '/piginput/transaction.txt' USING PigStorage(',') as (trans_id:int, id:int, age:int, total:float, num_items:int, description:chararray);
alldata = JOIN customers BY id, transactions BY id;
by_clusters_terms_count = FOREACH alldata COUNT(id);
产生错误:
Pig 堆栈跟踪
ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
Failed to parse: Pig script failed to parse:
<line 4, column 26> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1684)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1657)
at org.apache.pig.PigServer.registerQuery(PigServer.java:600)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1069)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:542)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by:
<line 4, column 26> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.parser.LogicalPlanBuilder.buildForeachOp(LogicalPlanBuilder.java:1041)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15870)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 15 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.newplan.logical.relational.LogicalSchema.merge(LogicalSchema.java:760)
at org.apache.pig.newplan.logical.relational.LOGenerate.getSchema(LOGenerate.java:158)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:123)
at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
at org.apache.pig.parser.LogicalPlanBuilder.buildForeachOp(LogicalPlanBuilder.java:1039)
... 21 more
有什么想法吗?我是否错误地加入了数据集导致了问题?
customers = LOAD 'hdfs://hadoop-VirtualBox:8020/piginput/customer.txt' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,CountryCode:int,salary:float);
A = foreach customers generate id, name;
transactions = LOAD 'hdfs://hadoop-VirtualBox:8020/piginput/transaction_small.txt' USING PigStorage(',') as (trans_id:int, cust_id:int, total:float, num_items:int, description:chararray);
B = foreach transactions generate cust_id,num_items;
alldata = JOIN A BY id, B BY cust_id;
C = GROUP alldata by [=10=];
这最终解决了问题
我有以下四个任务要处理,但对如何连接两个数据集以使任何任务正常工作感到困惑...
A) 查询成交次数最少的客户名称,输出客户名称,成交次数。
B) 使用广播(复制)加入加入客户和交易。报表:CustomerID、Name、Salary、NumOf Transactions、TotalSum、MinItems(其中 NumOfTransactions 是客户完成的交易总数,TotalSum 是该客户字段“TransTotal”的总和,MinItems 是最小项目数客户完成的交易。)
C) 报告客户数量大于 5,000 或小于 2,000 的国家/地区代码。
D) 假设我们要对数据设计一个分析任务如下:年龄属性分为六组,分别是[10, 20), [20, 30], [30, 40), [40、50]、[50、60) 和 [60、70]。在上述每个年龄段内,再根据“性别”进行进一步划分,即,将6个年龄段中的每一个进一步划分为两组。每个组报告:年龄范围、性别、MinTransTotal、MaxTransTotal、AvgTransTotal。注:括号“[”表示包含范围下限,“)”表示不包含范围上限。
这是我的开头:
hadoop fs -mkdir /piginput
sudo hadoop fs -put customer.txt /piginput
sudo hadoop fs -put transaction.txt /piginput
sudo hadoop fs -put transaction_small.txt /piginput
pig
customers = LOAD '/piginput/customers.txt' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,CountryCode:int,salary:float);
transactions = LOAD '/piginput/transaction.txt' USING PigStorage(',') as (trans_id:int, id:int, age:int, total:float, num_items:int, description:chararray);
alldata = JOIN customers BY id, transactions BY id;
by_clusters_terms_count = FOREACH alldata COUNT(id);
产生错误:
Pig 堆栈跟踪
ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
Failed to parse: Pig script failed to parse:
<line 4, column 26> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1684)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1657)
at org.apache.pig.PigServer.registerQuery(PigServer.java:600)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1069)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:542)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by:
<line 4, column 26> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.parser.LogicalPlanBuilder.buildForeachOp(LogicalPlanBuilder.java:1041)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15870)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 15 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.newplan.logical.relational.LogicalSchema.merge(LogicalSchema.java:760)
at org.apache.pig.newplan.logical.relational.LOGenerate.getSchema(LOGenerate.java:158)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:123)
at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
at org.apache.pig.parser.LogicalPlanBuilder.buildForeachOp(LogicalPlanBuilder.java:1039)
... 21 more
有什么想法吗?我是否错误地加入了数据集导致了问题?
customers = LOAD 'hdfs://hadoop-VirtualBox:8020/piginput/customer.txt' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,CountryCode:int,salary:float);
A = foreach customers generate id, name;
transactions = LOAD 'hdfs://hadoop-VirtualBox:8020/piginput/transaction_small.txt' USING PigStorage(',') as (trans_id:int, cust_id:int, total:float, num_items:int, description:chararray);
B = foreach transactions generate cust_id,num_items;
alldata = JOIN A BY id, B BY cust_id;
C = GROUP alldata by [=10=];
这最终解决了问题