Apache Pig - ORDER BY 错误 java.lang.ClassCastException:java.lang.String 无法转换为 java.lang.Integer
Apache Pig - ORDER BY error java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
我正在尝试在 Apache Pig 中执行 ORDER BY。它失败并出现错误 java.lang.ClassCastException:java.lang.String 无法转换为 java.lang.Integer
附上部分代码
movies = LOAD 'ml-1m/movies.dat' Using PigStorage('\t') as (movieblob:chararray);
A = FOREACH movies GENERATE STRSPLIT (movieblob,'::',3) as movie1:(id:int,title:chararray,genre:chararray);
B = FOREACH A GENERATE movie1.id, movie1.title, TOKENIZE (movie1.genre,'|');
C = FOREACH B GENERATE id, title, FLATTEN () as genre;
C_FLAT = FOREACH C GENERATE FLATTEN ([=13=]) as id:int, FLATTEN () as
title:chararray, FLATTEN () as genre:chararray;
\de C_FLAT
C_FLAT: {id: int,title: chararray,genre: chararray}
C_FLAT_ORDER = ORDER C_FLAT BY id;
\d C_FLAT_ORDER
2017-01-20 16:17:53,727 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1066: Unable to open iterator for alias C_FLAT_ORDER. Backend error :
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer
我尝试过并有效的解决方法是
store C_FLAT INTO 'ml-1m/movies/output' Using PigStorage('|');
m1 = LOAD 'ml-1m/movies/output' Using PigStorage('|') as
(id:int,title:chararray,genre:chararray);
order_m1 = ORDER m1 BY id;
grunt> \de m1
m1: {id: int,title: chararray,genre: chararray}
grunt> \d order_m1
输出几行
(3945,Digimon: The Movie (2000),Children's)
(3946,Get Carter (2000),Action)
(3946,Get Carter (2000),Drama)
(3946,Get Carter (2000),Thriller)
(3947,Get Carter (1971),Thriller)
(3948,Meet the Parents (2000),Comedy)
(3949,Requiem for a Dream (2000),Drama)
我试过了
C1 = FOREACH C GENERATE (int)id, title,genre;
并按 id 订购,但也失败了。任何帮助将不胜感激。
使用了来自输入文件的一些数据 - 'ml-1m/movies.dat'
3945::Digimon: The Movie (2000)::Adventure|Animation|Children's
3946::Get Carter (2000)::Action|Drama|Thriller
3947::Get Carter (1971)::Thriller
3948::Meet the Parents (2000)::Comedy
3949::Requiem for a Dream (2000)::Drama
3950::Tigerland (2000)::Drama
3951::Two Family House (2000)::Drama
3952::Contender, The (2000)::Drama|Thriller
如果您只是想订购数据,这就是您所需要的
movies = LOAD '/tmp/ml-1m/movies.dat' Using PigStorage('\t') as (movieblob:chararray);
A = FOREACH movies GENERATE FLATTEN(STRSPLIT(movieblob, '::', 3)) AS (id:int,title:chararray,genres:chararray);
B = ORDER A BY id;
我认为您的问题与尝试获取此内容有关
(3946,Get Carter (2000),Action|Drama|Thriller)
进入这个
(3946,Get Carter (2000),Action)
(3946,Get Carter (2000),Drama)
(3946,Get Carter (2000),Thriller)
在那种情况下,引用
- Splitting a tuple into multiple tuples in Pig
- In Apache Pig how can I serialise columns into rows?
- Pig Latin split columns to rows
inpt = LOAD '/tmp/ml-1m/movies.dat' Using PigStorage('\t') as (line:chararray);
splt = FOREACH inpt GENERATE FLATTEN(STRSPLIT([=13=], '::', 3));
movies = FOREACH splt GENERATE (int) [=13=] as id, (chararray) as title,FLATTEN(STRSPLITTOBAG(, '\|', -1)) AS genre;
ordered_movies = ORDER movies BY id;
\d ordered_movies
产生
(3945,Digimon: The Movie (2000),Adventure)
(3945,Digimon: The Movie (2000),Animation)
(3945,Digimon: The Movie (2000),Children's)
(3946,Get Carter (2000),Action)
(3946,Get Carter (2000),Drama)
(3946,Get Carter (2000),Thriller)
(3947,Get Carter (1971),Thriller)
(3948,Meet the Parents (2000),Comedy)
(3949,Requiem for a Dream (2000),Drama)
(3950,Tigerland (2000),Drama)
(3951,Two Family House (2000),Drama)
(3952,Contender, The (2000),Drama)
(3952,Contender, The (2000),Thriller)
我仍然觉得 Spark Dataframe 会更有用。
我正在尝试在 Apache Pig 中执行 ORDER BY。它失败并出现错误 java.lang.ClassCastException:java.lang.String 无法转换为 java.lang.Integer
附上部分代码
movies = LOAD 'ml-1m/movies.dat' Using PigStorage('\t') as (movieblob:chararray);
A = FOREACH movies GENERATE STRSPLIT (movieblob,'::',3) as movie1:(id:int,title:chararray,genre:chararray);
B = FOREACH A GENERATE movie1.id, movie1.title, TOKENIZE (movie1.genre,'|');
C = FOREACH B GENERATE id, title, FLATTEN () as genre;
C_FLAT = FOREACH C GENERATE FLATTEN ([=13=]) as id:int, FLATTEN () as
title:chararray, FLATTEN () as genre:chararray;
\de C_FLAT
C_FLAT: {id: int,title: chararray,genre: chararray}
C_FLAT_ORDER = ORDER C_FLAT BY id;
\d C_FLAT_ORDER
2017-01-20 16:17:53,727 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1066: Unable to open iterator for alias C_FLAT_ORDER. Backend error :
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer
我尝试过并有效的解决方法是
store C_FLAT INTO 'ml-1m/movies/output' Using PigStorage('|');
m1 = LOAD 'ml-1m/movies/output' Using PigStorage('|') as
(id:int,title:chararray,genre:chararray);
order_m1 = ORDER m1 BY id;
grunt> \de m1
m1: {id: int,title: chararray,genre: chararray}
grunt> \d order_m1
输出几行
(3945,Digimon: The Movie (2000),Children's)
(3946,Get Carter (2000),Action)
(3946,Get Carter (2000),Drama)
(3946,Get Carter (2000),Thriller)
(3947,Get Carter (1971),Thriller)
(3948,Meet the Parents (2000),Comedy)
(3949,Requiem for a Dream (2000),Drama)
我试过了
C1 = FOREACH C GENERATE (int)id, title,genre;
并按 id 订购,但也失败了。任何帮助将不胜感激。
使用了来自输入文件的一些数据 - 'ml-1m/movies.dat'
3945::Digimon: The Movie (2000)::Adventure|Animation|Children's
3946::Get Carter (2000)::Action|Drama|Thriller
3947::Get Carter (1971)::Thriller
3948::Meet the Parents (2000)::Comedy
3949::Requiem for a Dream (2000)::Drama
3950::Tigerland (2000)::Drama
3951::Two Family House (2000)::Drama
3952::Contender, The (2000)::Drama|Thriller
如果您只是想订购数据,这就是您所需要的
movies = LOAD '/tmp/ml-1m/movies.dat' Using PigStorage('\t') as (movieblob:chararray);
A = FOREACH movies GENERATE FLATTEN(STRSPLIT(movieblob, '::', 3)) AS (id:int,title:chararray,genres:chararray);
B = ORDER A BY id;
我认为您的问题与尝试获取此内容有关
(3946,Get Carter (2000),Action|Drama|Thriller)
进入这个
(3946,Get Carter (2000),Action)
(3946,Get Carter (2000),Drama)
(3946,Get Carter (2000),Thriller)
在那种情况下,引用
- Splitting a tuple into multiple tuples in Pig
- In Apache Pig how can I serialise columns into rows?
- Pig Latin split columns to rows
inpt = LOAD '/tmp/ml-1m/movies.dat' Using PigStorage('\t') as (line:chararray);
splt = FOREACH inpt GENERATE FLATTEN(STRSPLIT([=13=], '::', 3));
movies = FOREACH splt GENERATE (int) [=13=] as id, (chararray) as title,FLATTEN(STRSPLITTOBAG(, '\|', -1)) AS genre;
ordered_movies = ORDER movies BY id;
\d ordered_movies
产生
(3945,Digimon: The Movie (2000),Adventure)
(3945,Digimon: The Movie (2000),Animation)
(3945,Digimon: The Movie (2000),Children's)
(3946,Get Carter (2000),Action)
(3946,Get Carter (2000),Drama)
(3946,Get Carter (2000),Thriller)
(3947,Get Carter (1971),Thriller)
(3948,Meet the Parents (2000),Comedy)
(3949,Requiem for a Dream (2000),Drama)
(3950,Tigerland (2000),Drama)
(3951,Two Family House (2000),Drama)
(3952,Contender, The (2000),Drama)
(3952,Contender, The (2000),Thriller)
我仍然觉得 Spark Dataframe 会更有用。