我的 Pig 脚本正在将空文件创建到 HDFS
My Pig Script is creating empty files into HDFS
我有这个声明:
--Insert a new column based on filename
Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile');
Data_Schema = FOREACH Data GENERATE
(chararray) AS Date,
(chararray) AS ID,
(chararray) AS Interval,
(chararray) AS Code,
(chararray) AS S_In,
(chararray) AS S_Out,
(chararray) AS C_In,
(chararray) AS C_Out,
(chararray) AS Traffic;
--Split into different directories
SPLIT Data_Schema INTO Src1 IF (Date == '2016-06-25.txt'),
Src2 IF (Date == '2014-07-31.txt'),
Src3 IF (Date == '2016-01-01.txt');
STORE Src1 INTO '/user/cloudera/Source_DatA/2016-06-25' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2014-07-31.txt' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2016-01-01' using PigStorage('\t');
还有一个我的原始源数据的例子:
10000 1388530800000 39 8.600870350350515 13.86183926855984 1.7218329193014124 3.424444103320796 25.972920214509095
但是当我执行时它运行成功,但是HDFS中的文件没有数据...
请注意,我添加了一个基于文件名的新列。这就是为什么我在 Foreach Statment 中多了一列...
如果您的输入文件名为 2016-06-25.txt
、2014-07-31.txt
和 2016-01-01.txt
,那么新添加的列将被 [=15=]
引用并且它将包含文件名。
你必须这样做:
Data_Schema = FOREACH Data GENERATE
(chararray)[=10=] AS Date,
(chararray) AS ID,
...
或者在加载文件时简单地指定架构并保持其余部分不变:
Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile') as (Date:chararray, ID:chararray, ... ;
我有这个声明:
--Insert a new column based on filename
Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile');
Data_Schema = FOREACH Data GENERATE
(chararray) AS Date,
(chararray) AS ID,
(chararray) AS Interval,
(chararray) AS Code,
(chararray) AS S_In,
(chararray) AS S_Out,
(chararray) AS C_In,
(chararray) AS C_Out,
(chararray) AS Traffic;
--Split into different directories
SPLIT Data_Schema INTO Src1 IF (Date == '2016-06-25.txt'),
Src2 IF (Date == '2014-07-31.txt'),
Src3 IF (Date == '2016-01-01.txt');
STORE Src1 INTO '/user/cloudera/Source_DatA/2016-06-25' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2014-07-31.txt' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2016-01-01' using PigStorage('\t');
还有一个我的原始源数据的例子:
10000 1388530800000 39 8.600870350350515 13.86183926855984 1.7218329193014124 3.424444103320796 25.972920214509095
但是当我执行时它运行成功,但是HDFS中的文件没有数据...
请注意,我添加了一个基于文件名的新列。这就是为什么我在 Foreach Statment 中多了一列...
如果您的输入文件名为 2016-06-25.txt
、2014-07-31.txt
和 2016-01-01.txt
,那么新添加的列将被 [=15=]
引用并且它将包含文件名。
你必须这样做:
Data_Schema = FOREACH Data GENERATE
(chararray)[=10=] AS Date,
(chararray) AS ID,
...
或者在加载文件时简单地指定架构并保持其余部分不变:
Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile') as (Date:chararray, ID:chararray, ... ;