猪过滤器并获取原始数据集
pig filter and getting original dataset
我有一个 pig 输入文件,如下所示:
1, cornflakes, Regular, Post, 10
2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7
我需要过滤掉小于12的玉米片,我需要使用原始数据集进行下一步过滤。
total = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
filter1 = FILTER total BY item == 'cornflakes' AND price < 12;
现在需要使用filter1之后的原始数据集进行下一步的过滤
当你运行命令
filter1 = FILTER total BY item == 'cornflakes' AND price < 12;
不改变原来的关系,total。相反,它创建了一个新关系 - filter1。现在,您有两个关系要处理。您可以在程序中的任何位置访问总计。例如:
total = LOAD 'location_of_file' ... -- total relation is created
filter1 = FILTER total BY item == 'cornflakes' AND price < 12; -- filter1 is created
...
filter2 = filter total by ... -- filter2 is created
...
/* Now count rows of original total (total is unchanged) */
grouped = group total by all;
total_row_count = foreach grouped generate COUNT(total) as cnt;
使用SPLIT
total = LOAD '/output/systemhawk/file_inventory/test34.txt' USING PigStorage(',') AS (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT total INTO filter1 IF (item == 'cornflakes' AND price < 12),filter2 OTHERWISE;
DUMP filter2;
你为什么不使用 SPLIT?
total = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT total into F1_total IF (your considtion), f2_total if (your conditions);
此后您可以使用过滤设置为 f1_total,其余设置为 f2_total。根据您的需要应用条件
我有一个 pig 输入文件,如下所示:
1, cornflakes, Regular, Post, 10
2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7
我需要过滤掉小于12的玉米片,我需要使用原始数据集进行下一步过滤。
total = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
filter1 = FILTER total BY item == 'cornflakes' AND price < 12;
现在需要使用filter1之后的原始数据集进行下一步的过滤
当你运行命令
filter1 = FILTER total BY item == 'cornflakes' AND price < 12;
不改变原来的关系,total。相反,它创建了一个新关系 - filter1。现在,您有两个关系要处理。您可以在程序中的任何位置访问总计。例如:
total = LOAD 'location_of_file' ... -- total relation is created
filter1 = FILTER total BY item == 'cornflakes' AND price < 12; -- filter1 is created
...
filter2 = filter total by ... -- filter2 is created
...
/* Now count rows of original total (total is unchanged) */
grouped = group total by all;
total_row_count = foreach grouped generate COUNT(total) as cnt;
使用SPLIT
total = LOAD '/output/systemhawk/file_inventory/test34.txt' USING PigStorage(',') AS (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT total INTO filter1 IF (item == 'cornflakes' AND price < 12),filter2 OTHERWISE;
DUMP filter2;
你为什么不使用 SPLIT?
total = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT total into F1_total IF (your considtion), f2_total if (your conditions);
此后您可以使用过滤设置为 f1_total,其余设置为 f2_total。根据您的需要应用条件