猪：只计算特定的行

Question

我有一个包含 location、sentiment 和 brand 字段的数据。我想计算一个品牌在某个位置的正面、负面和中性的数量。

假设x有数据，我做了：

a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand), COUNT(x.sentiment=="positive"?1:0) AS positive_count, COUNT(x.sentiment=="negative"?1:0) AS negative_count, COUNT(x.sentiment=="neutral:?1:0) as neutral_count;

但是我收到一个语法错误 Unexpected character '"'

我尝试按所有三个分组：location, sentiment and brand 但我只得到总体计数，如：

{location: "newyork", brand: "pampers", sentiment = "positive", count = 10}
{location: "newyork", brand: "pampers", sentiment = "negative", count = 2}
{location: "newyork", brand: "pampers", sentiment = "neutral", count = 20}

我想要 positives_count、negatives_count 和 neutrals_count 的单独字段。像这样：

{location: "newyork", brand: "pampers", positive_count = 10, negative_count = 2, neutral_count = 20}
{location: "london", brand: "pampers", positive_count = 12, negative_count = 0, neutral_count = 35}
{location: "newyork", brand: "huggies", positive_count = 40, negative_count = 6, neutral_count = 10}

有人能帮帮我吗？

Answer 1

使用单引号

a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand), 
                    COUNT(x.sentiment=='positive'?1:0) AS positive_count, 
                    COUNT(x.sentiment=='negative'?1:0) AS negative_count, 
                    COUNT(x.sentiment=='neutral'?1:0) as neutral_count;

编辑

newyork pampers positive
newyork pampers positive
newyork pampers negative
newyork pampers positive
newyork pampers positive
newyork pampers neutral
newyork pampers positive
newyork pampers negative
newyork pampers neutral
newyork pampers positive
newyork pampers positive
newyork pampers neutral

脚本

B = GROUP A BY (location,brand);
C = FOREACH B  { 
                  A1 = FILTER A BY sentiment matches 'positive';
                  A2 = FILTER A BY sentiment matches 'negative';
                  A3 = FILTER A BY sentiment matches 'neutral';
                  GENERATE FLATTEN(group) as (location,brand),COUNT(A1),COUNT(A2),COUNT(A3);
               };

输出

Answer 2

我过滤了包含原始数据的别名并计算了每个条目的数量并将它们全部加入。

p = FILTER y BY (sentiment == 'positive');
p1 = GROUP p BY (location, brand, avl_author_type);
p2 = FOREACH p1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(p) AS positive_counts;

n = FILTER y BY (sentiment == 'negative');
n1 = GROUP n BY (location, brand, avl_author_type);
n2 = FOREACH n1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(n) AS negative_counts;

ne = FILTER y BY (sentiment == 'neutral');
ne1 = GROUP ne BY (location, brand, avl_author_type);
ne2 = FOREACH ne1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(ne) AS neutral_counts;

j1 = JOIN p2 BY (location, brand, avl_author_type) LEFT OUTER, n2 BY (location, brand, avl_author_type);
j2 = FOREACH j1 GENERATE p2::location as location, p2::brand as brand, p2::avl_author_type as avl_author_type, p2::positive_counts as positive_counts, n2::negative_counts as negative_counts;

j3 = JOIN j2 BY (location, brand, avl_author_type) LEFT OUTER, ne2 BY (location, brand, avl_author_type);
j4 = FOREACH j3 GENERATE j2::location as location, j2::brand as brand, j2::avl_author_type as avl_author_type, j2::positive_counts as positive, j2::negative_counts as negative, ne2::neutral_counts as neutral;

有点冗长但有效。

猪：只计算特定的行

Pig: Count only specific rows

group-by

count

apache-pig