如何忽略数据字段中的“,”
How to ignore "," in data fields
我正在尝试生成以下...
输入
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124440112951296,“00:00 #MAW”,WesleyBitton
A = LOAD '/user/root/data/tweets.csv' USING PigStorage(',') as (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
输出被截断
(396124436476092416,"想想你过的生活,但不要想得那么痛苦,生活真的是一份礼物)
输出除外
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse")
我不想把行读成行。
您可以使用CSVLoader加载数据
但是,如果您不想这样做,这里有 Apache Pig 本身的解决方法:
--加载您的数据
A = LOAD 'your/path/users.csv' USING TextLoader() AS (unparsed:chararray);
--将您的 "
字符串替换为 |
以便分隔您的推文
B = FOREACH A GENERATE REPLACE(unparsed, '\"', '|') AS parsed:chararray;
--将您的临时解析数据存储到您的位置
STORE B INTO 'your/path/parsed_users.csv' USING PigStorage('|');
--加载你解析的数据
C = LOAD 'your/path/parsed_users.csv' USING PigStorage('|') AS (users:chararray, tweets:chararray);
--转储你的数据,但它仍然会包含一个额外的逗号(,
),但你可以使用替换功能替换它,你明白了。
DUMP C;
这符合 csv standardization, so you need just to use CSVLoader
supports double-quoted fields that contain commas and other
double-quotes escaped with backslashes.
这是使用方法:
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/user/root/data/tweets.csv' USING CSVLoader AS (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
我正在尝试生成以下... 输入 396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09 396124440112951296,“00:00 #MAW”,WesleyBitton
A = LOAD '/user/root/data/tweets.csv' USING PigStorage(',') as (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
输出被截断 (396124436476092416,"想想你过的生活,但不要想得那么痛苦,生活真的是一份礼物)
输出除外 (396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse")
我不想把行读成行。
您可以使用CSVLoader加载数据
但是,如果您不想这样做,这里有 Apache Pig 本身的解决方法:
--加载您的数据
A = LOAD 'your/path/users.csv' USING TextLoader() AS (unparsed:chararray);
--将您的 "
字符串替换为 |
以便分隔您的推文
B = FOREACH A GENERATE REPLACE(unparsed, '\"', '|') AS parsed:chararray;
--将您的临时解析数据存储到您的位置
STORE B INTO 'your/path/parsed_users.csv' USING PigStorage('|');
--加载你解析的数据
C = LOAD 'your/path/parsed_users.csv' USING PigStorage('|') AS (users:chararray, tweets:chararray);
--转储你的数据,但它仍然会包含一个额外的逗号(,
),但你可以使用替换功能替换它,你明白了。
DUMP C;
这符合 csv standardization, so you need just to use CSVLoader
supports double-quoted fields that contain commas and other double-quotes escaped with backslashes.
这是使用方法:
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/user/root/data/tweets.csv' USING CSVLoader AS (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';