使用 Pig 计算前 5 个结果之外的 'Others'

Making a count for 'Others' outside of top 5 results with Pig

所以我生成了数百个组,我试图避免筛选它们而只查看结果最多的组。为此,我对它们进行计数,对它们进行排序,然后限制为前 5 个结果。

counts = foreach (group distinctVals by (description)) generate group, COUNT_STAR(distinctVals) as count;
ordered = order counts by count desc;
limited = limit ordered 5;
dump limited;

但是我想单独计算有多少结果没有进入 "top 5" 并将它们归为一组,简称为其他。

所以我的输出会是这样的

(John ,38436)
(Steve ,13654)
(Sarah ,9334)
(Rick ,3241)
(Morty ,784)
(Other ,3421)

使用RANK。排序数据后,使用 RANK 生成排序的排名 relation.This 将添加一个新的 rank_ordered 列作为第一个 column.You 然后可以使用排名列到 FILTER 数据集分为两个关系说有限,other.Once 你有另一个关系,GROUP ALLSUM 第三列即 $2 或计数 column.Finally,UNION有限和 other_sum.

counts = foreach (group distinctVals by (description)) generate group, COUNT_STAR(distinctVals) as count;
ordered = order counts by count desc;
ordered1 = rank ordered;

limited = FILTER ordered1 BY rank_ordered <= 5;
other = FILTER ordered1 BY rank_ordered > 5;

other_grp = GROUP other ALL;
other_sum = FOREACH other_grp GENERATE SUM(other.);

final = UNION limited,other_sum;