使用 int 数据类型加载时 apache pig 输出空值
apache pig output null values when loading with int datatype
我正在使用 pig-0.16.0
我正在尝试使用 pig 脚本连接两个制表符分隔文件 (.tsv)。一些列字段是整数类型,所以我试图将它们加载为 int。但是我看到我创建的 'int' 的任何列都没有加载数据,它们显示为空。我的连接没有输出任何结果,所以我退后一步,发现这个问题发生在加载步骤。我在这里粘贴我的猪脚本的一部分:
REGISTER /usr/local/pig/lib/piggybank.jar;
-- [=10=] = streaminputs/forum_node.tsv
-- = streaminputs/forum_users.tsv
u_f_n = LOAD '$file1' USING PigStorage('\t') AS (id: long, title: chararray, tagnames: chararray, author_id: long, body: chararray, node_type: chararray, parent_id: long, abs_parent_id: long, added_at: chararray, score: int, state_string: chararray, last_edited_id: long, last_activity_by_id: long, last_activity_at: chararray, active_revision_id: int, extra:chararray, extra_ref_id: int, extra_count:int, marked: chararray);
LUFN = LIMIT u_f_n 10;
STORE LUFN INTO 'pigout/LN';
u_f_u = LOAD '$file2' USING PigStorage('\t') AS (author_id: long, reputation: chararray, gold: chararray, silver: chararray, bronze: chararray);
LUFUU = LIMIT u_f_u 10;
STORE LUFUU INTO 'pigout/LU';
我尝试使用 long,但仍然是同样的问题,这里似乎只有 chararray 有效。那么,可能是什么问题?
来自两个输入 .tsv 文件的片段:
forum_nodes.tsv:
"id" "title" "tagnames" "author_id" "body" "node_type" "parent_id" "abs_parent_id" "added_at" "score" "state_string" "last_edited_id" "last_activity_by_id" "last_activity_at" "active_revision_id" "extra" "extra_ref_id" "extra_count" "marked"
"5339" "Whether pdf of Unit and Homework is available?" "cs101 pdf" "100000458" "" "question" "\N" "\N" "2012-02-25 08:09:06.787181+00" "1" "" "\N" "100000921" "2012-02-25 08:11:01.623548+00" "6922" "\N" "\N" "204" "f"
forum_users.tsv:
"user_ptr_id" "reputation" "gold" "silver" "bronze"
"100006402" "18" "0" "0" "0"
"100022094" "6354" "4" "12" "50"
"100018705" "76" "0" "3" "4"
"100021176" "213" "0" "1" "5"
"100045508" "505" "0" "1" "5"
你需要替换引号让猪知道它的 int
否则它会显示空白。您应该使用 CSVLoader 或 CSVExcelStorage,请参阅我的测试:
示例文件:
"1","test"
测试 1 - 使用 CSVLoader:
如果您想忽略引号,您可以使用 CSVLoader 或 CSVExcelStorage - 请参阅 example here
PIG 命令:
register '/usr/lib/pig/piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
file1 = load 'file1.txt' using CSVLoader(',') as (f1:int, f2:chararray);
输出:
(1,test)
测试 2 - 替换双引号:
PIG 命令:
file1 = load 'file1.txt' using PigStorage(',');
data = foreach file1 generate REPLACE([=13=],'\"','') as (f1:int) , as (f2:chararray);
输出:
(1,"test")
测试 3 - 按原样使用数据:
PIG 命令:
file1 = load 'file1.txt' using PigStorage(',') as (f1:int, f2:chararray);
输出:
(,"test")
我正在使用 pig-0.16.0 我正在尝试使用 pig 脚本连接两个制表符分隔文件 (.tsv)。一些列字段是整数类型,所以我试图将它们加载为 int。但是我看到我创建的 'int' 的任何列都没有加载数据,它们显示为空。我的连接没有输出任何结果,所以我退后一步,发现这个问题发生在加载步骤。我在这里粘贴我的猪脚本的一部分:
REGISTER /usr/local/pig/lib/piggybank.jar;
-- [=10=] = streaminputs/forum_node.tsv
-- = streaminputs/forum_users.tsv
u_f_n = LOAD '$file1' USING PigStorage('\t') AS (id: long, title: chararray, tagnames: chararray, author_id: long, body: chararray, node_type: chararray, parent_id: long, abs_parent_id: long, added_at: chararray, score: int, state_string: chararray, last_edited_id: long, last_activity_by_id: long, last_activity_at: chararray, active_revision_id: int, extra:chararray, extra_ref_id: int, extra_count:int, marked: chararray);
LUFN = LIMIT u_f_n 10;
STORE LUFN INTO 'pigout/LN';
u_f_u = LOAD '$file2' USING PigStorage('\t') AS (author_id: long, reputation: chararray, gold: chararray, silver: chararray, bronze: chararray);
LUFUU = LIMIT u_f_u 10;
STORE LUFUU INTO 'pigout/LU';
我尝试使用 long,但仍然是同样的问题,这里似乎只有 chararray 有效。那么,可能是什么问题?
来自两个输入 .tsv 文件的片段:
forum_nodes.tsv:
"id" "title" "tagnames" "author_id" "body" "node_type" "parent_id" "abs_parent_id" "added_at" "score" "state_string" "last_edited_id" "last_activity_by_id" "last_activity_at" "active_revision_id" "extra" "extra_ref_id" "extra_count" "marked"
"5339" "Whether pdf of Unit and Homework is available?" "cs101 pdf" "100000458" "" "question" "\N" "\N" "2012-02-25 08:09:06.787181+00" "1" "" "\N" "100000921" "2012-02-25 08:11:01.623548+00" "6922" "\N" "\N" "204" "f"
forum_users.tsv:
"user_ptr_id" "reputation" "gold" "silver" "bronze"
"100006402" "18" "0" "0" "0"
"100022094" "6354" "4" "12" "50"
"100018705" "76" "0" "3" "4"
"100021176" "213" "0" "1" "5"
"100045508" "505" "0" "1" "5"
你需要替换引号让猪知道它的 int
否则它会显示空白。您应该使用 CSVLoader 或 CSVExcelStorage,请参阅我的测试:
示例文件:
"1","test"
测试 1 - 使用 CSVLoader:
如果您想忽略引号,您可以使用 CSVLoader 或 CSVExcelStorage - 请参阅 example here
PIG 命令:
register '/usr/lib/pig/piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
file1 = load 'file1.txt' using CSVLoader(',') as (f1:int, f2:chararray);
输出:
(1,test)
测试 2 - 替换双引号:
PIG 命令:
file1 = load 'file1.txt' using PigStorage(',');
data = foreach file1 generate REPLACE([=13=],'\"','') as (f1:int) , as (f2:chararray);
输出:
(1,"test")
测试 3 - 按原样使用数据:
PIG 命令:
file1 = load 'file1.txt' using PigStorage(',') as (f1:int, f2:chararray);
输出:
(,"test")