apache pig 无法执行分组和计数
apache pig unable to perform grouping and counting
我是 Pig 脚本的新手。请帮我解决这个问题。
我不知道我哪里出错了。
我的数据
(catA,myid_1,2014,store1,appl)
(catA,myid_2,2014,store1,milk)
(catA,myid_3,2014,store1,appl)
(catA,myid_4,2014,store1,milk)
(catA,myid_5,2015,store1,milk)
(catB,myid_6,2014,store2,milk)
(catB,myid_7,2014,store2,appl)
下面是预期的结果
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
需要根据类别、年份统计食品数量。
下面是我的猪脚本
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) as my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store);
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),COUNT(food_count);
}
DUMP result;
上面脚本的输出如下
(catA,2014,store1,2)
(catA,2015,store1,1)
(catB,2014,store2,2)
任何人都可以让我知道我的脚本哪里错了
谢谢
StoreG = GROUP list_of BY (category,my_date,my_store);
应该是
StoreG = GROUP list_of BY (category,my_date,item);
因为您的预期结果是按项目而非商店分组。
一种方法 it.Not 最优雅但有效的示例:
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) AS my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store,item);
result = FOREACH StoreG GENERATE
group.category AS category,
group.my_date AS my_date,
group.my_store AS mys_store,
group.item AS item,
COUNT(list_of.item) AS nb_items;
DUMP result;
当我们在 GROUP BY
语句中添加别名 item 时,基本上与查找不同的项目然后计算它们相同(就像您在括号中所做的那样) .
如果您仍想使用您的代码,只需在下面的代码中添加关系 food_list.item
即可:
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),food_list.item,COUNT(food_count);
}
我是 Pig 脚本的新手。请帮我解决这个问题。 我不知道我哪里出错了。
我的数据
(catA,myid_1,2014,store1,appl)
(catA,myid_2,2014,store1,milk)
(catA,myid_3,2014,store1,appl)
(catA,myid_4,2014,store1,milk)
(catA,myid_5,2015,store1,milk)
(catB,myid_6,2014,store2,milk)
(catB,myid_7,2014,store2,appl)
下面是预期的结果
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
需要根据类别、年份统计食品数量。 下面是我的猪脚本
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) as my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store);
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),COUNT(food_count);
}
DUMP result;
上面脚本的输出如下
(catA,2014,store1,2)
(catA,2015,store1,1)
(catB,2014,store2,2)
任何人都可以让我知道我的脚本哪里错了 谢谢
StoreG = GROUP list_of BY (category,my_date,my_store);
应该是
StoreG = GROUP list_of BY (category,my_date,item);
因为您的预期结果是按项目而非商店分组。
一种方法 it.Not 最优雅但有效的示例:
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) AS my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store,item);
result = FOREACH StoreG GENERATE
group.category AS category,
group.my_date AS my_date,
group.my_store AS mys_store,
group.item AS item,
COUNT(list_of.item) AS nb_items;
DUMP result;
当我们在 GROUP BY
语句中添加别名 item 时,基本上与查找不同的项目然后计算它们相同(就像您在括号中所做的那样) .
如果您仍想使用您的代码,只需在下面的代码中添加关系 food_list.item
即可:
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),food_list.item,COUNT(food_count);
}