将 Movielens 中的数据加载到猪中时出现问题

Issue in Loading data from Movielens into pig

我正在尝试将一些数据加载到 Pig 中:

记录:

11::American President, The (1995)::Comedy|Drama|Romance

12::Dracula: Dead and Loving It (1995)::Comedy|Horror

使用的脚本:

loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat' 
               USING PigStorage(':') 
               AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);

输出

 11,,American President, The (1995),,Comedy|Drama|Romance
 12,,Dracula,, Dead and Loving It (1995)

如何处理 Dracula.- 之后的冒号 (:)?

由于冒号的原因,第二列被分成两列,因为我们总共有 3 列,所以 movieid 12 comedy|horror 的最后一列没有加载。

您可以使用 REGEX_EXTRACT_ALL 来实现。

下面是一段代码,实现了这个:

A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat' 
               AS (f1:chrarray); 
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN([=10=]);
D = FOREACH C GENERATE [=10=] AS (MovieID:long),  AS (Title:chararray),  AS (Genre:chararray);
DUMP D;

我得到了以下输出(这是一个元组)。 ":"后"Dracula"完好

(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)