将 Movielens 中的数据加载到猪中时出现问题
Issue in Loading data from Movielens into pig
我正在尝试将一些数据加载到 Pig 中:
记录:
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
使用的脚本:
loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
USING PigStorage(':')
AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);
输出
11,,American President, The (1995),,Comedy|Drama|Romance
12,,Dracula,, Dead and Loving It (1995)
如何处理 Dracula.- 之后的冒号 (:)?
由于冒号的原因,第二列被分成两列,因为我们总共有 3 列,所以 movieid 12 comedy|horror
的最后一列没有加载。
您可以使用 REGEX_EXTRACT_ALL
来实现。
下面是一段代码,实现了这个:
A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
AS (f1:chrarray);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN([=10=]);
D = FOREACH C GENERATE [=10=] AS (MovieID:long), AS (Title:chararray), AS (Genre:chararray);
DUMP D;
我得到了以下输出(这是一个元组)。 ":"后"Dracula"完好
(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)
我正在尝试将一些数据加载到 Pig 中:
记录:
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
使用的脚本:
loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
USING PigStorage(':')
AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);
输出
11,,American President, The (1995),,Comedy|Drama|Romance
12,,Dracula,, Dead and Loving It (1995)
如何处理 Dracula.- 之后的冒号 (:)?
由于冒号的原因,第二列被分成两列,因为我们总共有 3 列,所以 movieid 12 comedy|horror
的最后一列没有加载。
您可以使用 REGEX_EXTRACT_ALL
来实现。
下面是一段代码,实现了这个:
A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
AS (f1:chrarray);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN([=10=]);
D = FOREACH C GENERATE [=10=] AS (MovieID:long), AS (Title:chararray), AS (Genre:chararray);
DUMP D;
我得到了以下输出(这是一个元组)。 ":"后"Dracula"完好
(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)