如何检查 PIG 中过滤元素的 COUNT

How to check COUNT of filtered elements in PIG

我有以下数据集,我需要在其中根据汽车的公司名称执行一些步骤。

            (23,Nissan,12.43)
            (23,Nissan Car,16.43)
            (23,Honda Car,13.23)
            (23,Toyota Car,17.0)
            (24,Honda,45.0)
            (24,Toyota,12.43)
            (24,Nissan Car,12.43)


          A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
          G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\b)\s*(Car)*$',1));
            DUMP G;

我根据代码和基本公司名称对汽车进行分组,例如所有 'Nissan' 和 'Nissan Car' 记录应归为一组,其他记录也类似。

    /* Grouped data based on code and company's first name*/ 
            ((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
            ((23,Honda),{(23,Honda Car,13.23)})
            ((23,Toyota),{(23,Toyota Car,17.0)})
            ((24,Nissan),{(24,Nissan Car,12.43)})
            ((24,Honda),{(24,Honda,45.0)})
            ((24,Toyota),{(24,Toyota,12.43)})

现在,我想根据是否包含与组名对应的元组来过滤组。如果是,则从该组中取出该元组并忽略其他元组,如果不存在这样的元组,则取出该组的所有元组。

输出应该是:

            ((23,Nissan),{(23,Nissan,12.43)})  // Since this group contains a row with group's name i.e. Nissan
            ((23,Honda),{(23,Honda Car,13.23)})
            ((23,Toyota),{(23,Toyota Car,17.0)})
            ((24,Nissan),{(24,Nissan Car,12.43)})
            ((24,Honda),{(24,Honda,45.0)})
            ((24,Toyota),{(24,Toyota,12.43)})

            R = FOREACH G { OW = FILTER A BY name==group.; IF COUNT(OW) > 0}

任何人都可以帮助我如何做到这一点?按组名过滤后?如何找到过滤元组的计数并获取所需数据。

好的。让我们考虑以下记录是您的输入。

23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43 

对于上面的输入,假设下面是中间输出

((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})

考虑一下,从上面的中间输出中,您正在根据您的要求寻找以下输出。

(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)

下面是代码..

nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);

nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\b)\s*(Car)*$',1)) as brand_name,name,rating;

nissan_grp = GROUP nissan_each by (code,brand_name);


nissan_final_each =FOREACH nissan_grp {
             A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
             B = (int)SUM(A);

             C = FOREACH nissan_each  GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
             D = SUM(C);

             generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
 };


dump nissan_final_each;

也用不同的输入试试这个代码..