基于分组的 Pig 脚本
Pig Script Based On Grouping
我有一个这样的数据集。
cus_ID BRAND AMOUNT
1 5 10
2 4 20
3 5 15
1 5 20
1 4 30
2 3 15
我想使用 PIG 查找前 5 个品牌和前 5 个品牌中每个品牌的前 10 个客户 ID。
对于你的第一个目标(找到前 5 个品牌),现在开始(代码未测试):
mydata = LOAD ... <load your data from your file or other source>
grouped = GROUP mydata BY brand;
flattened = FOREACH grouped GENERATE
FLATTEN(group) AS brand,
SUM(mydata.amount) AS amount_per_brand;
topfivebrand = LIMIT (ORDER flattened by amount_per_brand DESC) 5;
dump topfivebrand;
这应该让你开始了! :)
我有一个这样的数据集。
cus_ID BRAND AMOUNT
1 5 10
2 4 20
3 5 15
1 5 20
1 4 30
2 3 15
我想使用 PIG 查找前 5 个品牌和前 5 个品牌中每个品牌的前 10 个客户 ID。
对于你的第一个目标(找到前 5 个品牌),现在开始(代码未测试):
mydata = LOAD ... <load your data from your file or other source>
grouped = GROUP mydata BY brand;
flattened = FOREACH grouped GENERATE
FLATTEN(group) AS brand,
SUM(mydata.amount) AS amount_per_brand;
topfivebrand = LIMIT (ORDER flattened by amount_per_brand DESC) 5;
dump topfivebrand;
这应该让你开始了! :)