无法转储 PIG 中的关系

Unable to dump a relation in PIG

很久以来一直被困在一个问题上。任何帮助将不胜感激。 所以我在 /home/hadoop/pig 目录中有一个数据集文件。我可以查看该文件,因此没有权限问题。 数据集有 4 列,由“::”作为分隔符分隔。 我是 运行 本地模式的猪,来自 /home/hadoop/pig 目录。

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP ratings BY mid;

dump grouped_mid;

以上脚本失败。我可以成功转储 'ratingsData' 和 'ratings' 关系但不能转储 grouped_mid。但这是奇怪的部分。下面的脚本运行成功。

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

STORE ratings INTO 'ratingInfo.txt';

X = LOAD 'ratingInfo.txt' AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP X BY mid;

dump grouped_mid;

显然,第二个脚本有一个多余的步骤。我只是存储一个关系并再次重新加载它。我想避免这种情况。 任何 clarification/explanation 都非常值得赞赏。

非常感谢。

仅供参考:pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

您可以将脚本修改为:

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN((tuple(int, int, int, long))REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP ratings BY mid;

dump grouped_mid;

已测试。