猪编程通过计数(*)在组上使用拆分
pig programming to use split on group by having count(*)
输入文件是:
2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7
filter3 = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT filter3 INTO filter4 IF (FOREACH (filter3 GROUP BY item) GENERATE group, COUNT(item < 3)), filter6_pass OTHERWISE;
这就像在计数 (*) < 3
的项目上使用分组依据 SQL
期望的输出是:
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7
按项目分组,获取计数,然后对计数使用过滤器
A = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
B = GROUP A BY item;
C = FOREACH B GENERATE group,COUNT(A.item) AS Total;
D = FILTER C BY Total > 3;
E = JOIN A BY item,D BY [=10=];
F = FOREACH E GENERATE [=10=]..;
DUMP F;
输入文件是:
2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7
filter3 = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
SPLIT filter3 INTO filter4 IF (FOREACH (filter3 GROUP BY item) GENERATE group, COUNT(item < 3)), filter6_pass OTHERWISE;
这就像在计数 (*) < 3
的项目上使用分组依据 SQL期望的输出是:
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7
按项目分组,获取计数,然后对计数使用过滤器
A = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
B = GROUP A BY item;
C = FOREACH B GENERATE group,COUNT(A.item) AS Total;
D = FILTER C BY Total > 3;
E = JOIN A BY item,D BY [=10=];
F = FOREACH E GENERATE [=10=]..;
DUMP F;