猪过滤器并获取原始数据集

Question

我有一个 pig 输入文件，如下所示：

1, cornflakes, Regular, Post, 10
2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

我需要过滤掉小于12的玉米片，我需要使用原始数据集进行下一步过滤。

total = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
filter1 = FILTER total BY item == 'cornflakes' AND price < 12;

现在需要使用filter1之后的原始数据集进行下一步的过滤

Answer 1

当你运行命令

filter1 = FILTER total BY item == 'cornflakes' AND price < 12;

不改变原来的关系，total。相反，它创建了一个新关系 - filter1。现在，您有两个关系要处理。您可以在程序中的任何位置访问总计。例如：

total = LOAD 'location_of_file' ...   -- total relation is created
filter1 = FILTER total BY item == 'cornflakes' AND price < 12; -- filter1 is created
...
filter2 = filter total by ... -- filter2 is created
...

/* Now count rows of original total (total is unchanged) */
grouped = group total by all;
total_row_count = foreach grouped generate COUNT(total) as cnt;

Answer 2

使用SPLIT

total = LOAD '/output/systemhawk/file_inventory/test34.txt' USING PigStorage(',') AS (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT total INTO filter1 IF (item == 'cornflakes' AND price < 12),filter2 OTHERWISE;
DUMP filter2;

Answer 3

你为什么不使用 SPLIT？

total = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT total into F1_total IF (your considtion), f2_total if (your conditions);

此后您可以使用过滤设置为 f1_total，其余设置为 f2_total。根据您的需要应用条件

猪过滤器并获取原始数据集

pig filter and getting original dataset

apache-pig